# Environment Setting

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Pipeline 1 - Embedding

### Step 1. Loading

In this step, we load data from various sources. Make them ready to ingest.

In [2]:
!pip install -q -U arxiv

Collecting arxiv
  Downloading arxiv-2.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Downloading arxiv-2.1.3-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
   ---------------------------------------- 0.0/81.3 kB ? eta -:--:--
   -------------------- ------------------- 41.0/81.3 kB 1.9 MB/s eta 0:00:01
   ---------------------------------------- 81.3/81.3 kB 1.1 MB/s eta 0:00:00
Building wheels for collected packages: sgmllib3k
  

In [2]:
from langchain.document_loaders import ArxivLoader

base_docs = ArxivLoader(query="Retrieval Augmented Generation", load_max_docs=5).load()

In [3]:
for doc in base_docs:
  print(doc.metadata)

{'Published': '2024-06-19', 'Title': 'R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation', 'Authors': 'Fuda Ye, Shuangyin Li, Yongqi Zhang, Lei Chen', 'Summary': "Retrieval augmented generation (RAG) has been applied in many scenarios to\naugment large language models (LLMs) with external documents provided by\nretrievers. However, a semantic gap exists between LLMs and retrievers due to\ndifferences in their training objectives and architectures. This misalignment\nforces LLMs to passively accept the documents provided by the retrievers,\nleading to incomprehension in the generation process, where the LLMs are\nburdened with the task of distinguishing these documents using their inherent\nknowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill\nthis gap by incorporating Retrieval information into Retrieval Augmented\nGeneration. Specifically, R$^2$AG utilizes the nuanced features from the\nretrievers and employs a R$^2$-Former to capt

### Step 2. Parsing

##### Type 1. text document

In [3]:
from langchain.document_loaders import TextLoader

In [None]:
txt_path = DOCUMENT+"rag.txt"
txt_loader = TextLoader(txt_path)
text_documents = txt_loader.load()
#text_documents

##### Type 2. PDF document

We use PyMuPDFLoader in this experiment

In [6]:
from langchain.document_loaders import PyMuPDFLoader
pdf_path = DOCUMENT+ "2005.11401v4.pdf"
pdf_loader = PyMuPDFLoader(pdf_path)
pdf_documents = pdf_loader.load()

### Step 3. Chunking

Chunk text file

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=20)
text_chunks = text_splitter.split_documents(text_documents)
#documents[:3]

Chunk PDF File

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
pdf_chunks = text_splitter.split_documents(pdf_documents)

Chunk Online Docs

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=250)
doc_chunks = text_splitter.split_documents(base_docs)

In [9]:
chunks = text_chunks + pdf_chunks

### Step 4. Vectorizing

Option 1: Using openAI embedding API

In [6]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import DocArrayInMemorySearch

In [7]:
embeddings = OpenAIEmbeddings()
# vectorstore = DocArrayInMemorySearch.from_documents(chunks, embeddings)

Option 2: 

### Step 5. Storing

Trying to persist the vectordb with Chroma

In [8]:
from langchain.vectorstores import Chroma
persist_directory = os.getenv("ARXIVSTORE")
vectordb = Chroma.from_documents(documents=doc_chunks,  embedding=embeddings, persist_directory=persist_directory)
vectordb.persist()

  warn_deprecated(


# Pipline 2. Retrieving

### Step 1. Query

In [9]:
user_query = "What is retrieval augmented generation"
#user_query = "Describe the RAG-Sequence Model?"

### Step 2. Search

Need to load from store if there is. Here the on memory vectorstore is used. 
There is opportunity to improve efficiency of search when the knowledgebase gets larger and more complicated (type of sources)

In [10]:
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [18]:
from langchain_openai.embeddings import OpenAIEmbeddings

In [11]:
#retriever = vectorstore.as_retriever()

#Load vectordb from persisted store
from langchain.vectorstores import Chroma
persist_directory = os.getenv("ARXIVSTORE")
embeddings = OpenAIEmbeddings()
newvectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
retriever = newvectordb.as_retriever()

### Step 3. Augmented Prompt

In [12]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. 
If you can't answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [13]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
setup = RunnableParallel(context=retriever, question=RunnablePassthrough())

### Step 4. Response Generating

Option 1: Using on-cloud OpenAI

In [14]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")
parser = StrOutputParser()

In [15]:
chain = setup | prompt | model | parser

In [16]:
response = chain.invoke(user_query)
response

'Retrieval-augmented generation is a text generation approach that involves using retrieval sources, retrieval metrics, and generation models to enhance the generation process. It has shown remarkable advantages and achieved state-of-the-art performance in many natural language processing tasks.'

Test the chain

In [17]:
test_retrieval = retriever.invoke(user_query)
test_retrieval

[Document(metadata={'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Published': '2022-02-13', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performance in many\nNLP tasks. This paper aims to conduct a survey about retrieval-augmented text\ngeneration. It firstly highlights the generic paradigm of retrieval-augmented\ngeneration, and then it reviews notable approaches according to different tasks\nincluding dialogue response generation, machine translation, and other\ngeneration tasks. Finally, it points out some important directions on top of\nrecent methods to facilitate future research.', 'Title': 'A Survey on Retrieval-Augmented Text Generation'}, page_content='et al., 2021b), and knowledge-intensive generation\n(Lewis et 

Run a bot

In [34]:
while True:
        user_input = input("Enter a query: ")
        if user_input == "exit":
            break

        try:
            response = chain.invoke(user_input)
            print(response)
        except Exception as err:
            print('Exception occurred. Please try again', str(err))

RAG stands for Retrieval Augmented Generation, which is a framework that combines large language models (LLMs) with external documents provided by retrievers to improve performance in various tasks.
To implement RAG (Retrieval Augmented Generation), you need to focus on enhancing tool retrieval, which can lead to improvements in plan generation. Additionally, you can experiment with different methods such as CoT, RECOMP, CRAG, Self-RAG, LongLLMLingua, and R^2AG to enhance the RAG framework.
To evaluate RAG application, one can compare different methods such as standard RAG using various LLMs and enhanced RAG using the same foundation LLM. Additionally, one can evaluate standard RAG baselines where LLMs generate responses given the query prepended with retrieved documents. Experiments across multiple datasets can be conducted to validate the effectiveness, robustness, and efficiency of RAG applications.
I don't know.


# RAG Evaluation

In [35]:
!pip install -q -U ragas

Collecting ragas
  Downloading ragas-0.1.10-py3-none-any.whl.metadata (5.2 kB)
Collecting datasets (from ragas)
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pysbd>=0.3.4 (from ragas)
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Collecting appdirs (from ragas)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting pyarrow>=15.0.0 (from datasets->ragas)
  Downloading pyarrow-16.1.0-cp312-cp312-win_amd64.whl.metadata (3.1 kB)
Collecting pyarrow-hotfix (from datasets->ragas)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->ragas)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting pandas (from datasets->ragas)
  Using cached pandas-2.2.2-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting xxhash (from datasets->ragas)
  Downloading xxhash-3.4.1-cp312-cp312-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets->ragas)
  Dow

In [31]:
!pip install -q -U tqdm

In [None]:
!pip install -q -U 

### Generate synthesis Test Dataset

In [36]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# documents = load your documents

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-4-1106-preview", temperature=0) 
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

In [41]:
import nest_asyncio
nest_asyncio.apply()

In [44]:
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Change resulting question type distribution
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

try:
    testset = generator.generate_with_langchain_docs(base_docs, test_size=5, distributions = distributions) 
except Exception as e:
    print (e)

# use generator.generate_with_llamaindex_docs if you use llama-index as document loader

Filename and doc_id are the same for all nodes.                   
Generating: 100%|██████████| 5/5 [01:11<00:00, 14.28s/it]


Simpler Testset generator

In [42]:
simple_generator = TestsetGenerator.with_openai()

testset = simple_generator.generate_with_langchain_docs(doc_chunks, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

  simple_generator = TestsetGenerator.with_openai()
Exception in thread Thread-79:                                      
Traceback (most recent call last):
  File "C:\Users\derek\AppData\Local\Programs\Python\Python312\Lib\threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "c:\Users\derek\OneDrive\1 - Technology\Workspace\rag_win\Lib\site-packages\ragas\executor.py", line 87, in run
    results = self.loop.run_until_complete(self._aresults())
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\derek\OneDrive\1 - Technology\Workspace\rag_win\Lib\site-packages\nest_asyncio.py", line 98, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "C:\Users\derek\AppData\Local\Programs\Python\Python312\Lib\asyncio\futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "C:\Users\derek\AppData\Local\Programs\Python\Python312\Lib\asyncio\tasks.py", line 314, in __step_run_and_handle_result
  

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass `raise_exceptions=False` incase you want to show only a warning message instead.

  self._tasks: set[asyncio.Task] = set()


In [46]:
testset.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What challenges do large language models (LLMs...,[Corrective Retrieval Augmented Generation\nSh...,Large language models (LLMs) face challenges s...,simple,"[{'Published': '2024-02-16', 'Title': 'Correct...",True
1,How does incorporating relevant context in pla...,[etuned Semantic\nSearch\n73.48\n88.52\n95.13\...,Incorporating relevant context in plan generat...,simple,"[{'Published': '2023-12-09', 'Title': 'Context...",True
2,How does the retrieval evaluator in CRAG impro...,[Corrective Retrieval Augmented Generation\nSh...,The retrieval evaluator in CRAG assesses the o...,multi_context,"[{'Published': '2024-02-16', 'Title': 'Correct...",True
3,What LLM with a 32k token limit powers DuReade...,[\n0.0265\n0.0830\n0.0156\n0.2666\n0.0329\nCRA...,The foundation LLM for DuReader's improved RAG...,multi_context,"[{'Published': '2024-06-19', 'Title': 'R^2AG: ...",True
4,What foundation LLM is used for the DuReader d...,[\n0.0265\n0.0830\n0.0156\n0.2666\n0.0329\nCRA...,The foundation LLM used for the DuReader datas...,simple,"[{'Published': '2024-06-19', 'Title': 'R^2AG: ...",True


### Run evaluation on our RAG chain

In [47]:
questions = testset.to_pandas()["question"].to_list()
ground_truth = testset.to_pandas()["ground_truth"].to_list()

In [48]:
questions

['What challenges do large language models (LLMs) face that Corrective Retrieval Augmented Generation (CRAG) aims to address?',
 'How does incorporating relevant context in plan generation help reduce hallucination in the context-tuned planner?',
 'How does the retrieval evaluator in CRAG improve doc use for targeted knowledge creation?',
 "What LLM with a 32k token limit powers DuReader's improved RAG, and how does its F1 score match up?",
 'What foundation LLM is used for the DuReader dataset, and how does its performance compare with other methods?']

In [49]:
ground_truth

['Large language models (LLMs) face challenges such as hallucinations, factual errors, and the inability to secure the accuracy of generated texts solely by the parametric knowledge they encapsulate. Corrective Retrieval Augmented Generation (CRAG) aims to address these challenges by improving the robustness of generation through a lightweight retrieval evaluator, large-scale web searches, and a decompose-then-recompose algorithm for retrieved documents.',
 'Incorporating relevant context in plan generation helps reduce hallucination in the context-tuned planner, as evidenced by the upper bound, which effectively employs oracle retrievers.',
 'The retrieval evaluator in CRAG assesses the overall quality of retrieved documents for a query, returning a confidence degree based on which different knowledge retrieval actions can be triggered. This allows for selective focus on key information and filtering out irrelevant information in the retrieved documents, thereby improving the utilizat

In [50]:
from datasets import Dataset

data = {"question": [], "answer": [], "contexts": [], "ground_truth": ground_truth}

for query in questions:
    data["question"].append(query)
    data["answer"].append(chain.invoke(query))
    data["contexts"].append([doc.page_content for doc in retriever.get_relevant_documents(query)])

dataset = Dataset.from_dict(data)

  warn_deprecated(


In [65]:
retriever.get_relevant_documents(questions[1])

[Document(metadata={'Authors': 'Raviteja Anantha, Tharun Bethi, Danil Vodianik, Srinivas Chappidi', 'Published': '2023-12-09', 'Summary': "Large language models (LLMs) have the remarkable ability to solve new tasks\nwith just a few examples, but they need access to the right tools. Retrieval\nAugmented Generation (RAG) addresses this problem by retrieving a list of\nrelevant tools for a given task. However, RAG's tool retrieval step requires\nall the required information to be explicitly present in the query. This is a\nlimitation, as semantic search, the widely adopted tool retrieval method, can\nfail when the query is incomplete or lacks context. To address this limitation,\nwe propose Context Tuning for RAG, which employs a smart context retrieval\nsystem to fetch relevant information that improves both tool retrieval and plan\ngeneration. Our lightweight context retrieval model uses numerical,\ncategorical, and habitual usage signals to retrieve and rank context items. Our\nempiric

In [51]:
dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 5
})

In [52]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

result = evaluate(
    dataset = dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

Evaluating: 100%|██████████| 20/20 [00:09<00:00,  2.17it/s]


In [61]:
import pandas as pd
result_pd = result.to_pandas()
pd.set_option("display.max_colwidth", 700)
result_pd[["question", "contexts", "answer", "ground_truth","faithfulness"]]

Unnamed: 0,question,contexts,answer,ground_truth,faithfulness
0,What challenges do large language models (LLMs) face that Corrective Retrieval Augmented Generation (CRAG) aims to address?,"[show that CRAG can significantly improve the\nperformance of RAG-based approaches.1\n1\nIntroduction\nLarge language models (LLMs) have attracted\nincreasing attention and exhibited impressive abili-\nties to understand instructions and generate fluent, show that CRAG can significantly improve the\nperformance of RAG-based approaches.1\n1\nIntroduction\nLarge language models (LLMs) have attracted\nincreasing attention and exhibited impressive abili-\nties to understand instructions and generate fluent, covering short- and long-form generation tasks\nshow that CRAG can significantly improve the\nperformance of RAG-based approaches.1\n1\nIntroduction\nLarge language models (LLMs) have att...","Large language models (LLMs) face challenges of hallucinations, where the accuracy of generated texts cannot be guaranteed solely by the parametric knowledge they encapsulate. Corrective Retrieval Augmented Generation (CRAG) aims to improve the robustness of generation by addressing concerns about the relevance of retrieved documents and potential issues if retrieval goes wrong.","Large language models (LLMs) face challenges such as hallucinations, factual errors, and the inability to secure the accuracy of generated texts solely by the parametric knowledge they encapsulate. Corrective Retrieval Augmented Generation (CRAG) aims to address these challenges by improving the robustness of generation through a lightweight retrieval evaluator, large-scale web searches, and a decompose-then-recompose algorithm for retrieved documents.",0.0
1,How does incorporating relevant context in plan generation help reduce hallucination in the context-tuned planner?,"[5. We show that context augmentation at plan\ngeneration reduces hallucinations.\n2\nRelated Work\nUsing retrieval to incorporate tools into plan gen-\neration with LLMs has emerged as a burgeoning\narea of research, with ongoing investigations aimed, 5. We show that context augmentation at plan\ngeneration reduces hallucinations.\n2\nRelated Work\nUsing retrieval to incorporate tools into plan gen-\neration with LLMs has emerged as a burgeoning\narea of research, with ongoing investigations aimed, denced by the upper bound, helps in reducing hal-\nlucination.\n5\nConclusion\nOur work introduces context tuning, a novel compo-\nnent that enhances RAG-based planning by equip-\nping it wit...","Incorporating relevant context in plan generation helps reduce hallucination in the context-tuned planner by providing essential information and signals that guide the generation process, leading to more accurate and informed decisions.","Incorporating relevant context in plan generation helps reduce hallucination in the context-tuned planner, as evidenced by the upper bound, which effectively employs oracle retrievers.",0.25
2,How does the retrieval evaluator in CRAG improve doc use for targeted knowledge creation?,"[retrieval evaluator is to estimate and trigger three\nknowledge retrieval actions discriminately. With\nthe further leverage of web search and optimized\nknowledge utilization, CRAG has significantly im-, retrieval evaluator is to estimate and trigger three\nknowledge retrieval actions discriminately. With\nthe further leverage of web search and optimized\nknowledge utilization, CRAG has significantly im-, heavily on the relevance of retrieved docu-\nments, raising concerns about how the model\nbehaves if retrieval goes wrong. To this end, we\npropose the Corrective Retrieval Augmented\nGeneration (CRAG) to improve the robustness\nof generation., heavily on the relevance of retrieved do...","The retrieval evaluator in CRAG assesses the overall quality of retrieved documents for a query and returns a confidence degree based on which different knowledge retrieval actions can be triggered. This helps in selectively focusing on key information and filtering out irrelevant information in the retrieved documents, thus improving the use of documents for targeted knowledge creation.","The retrieval evaluator in CRAG assesses the overall quality of retrieved documents for a query, returning a confidence degree based on which different knowledge retrieval actions can be triggered. This allows for selective focus on key information and filtering out irrelevant information in the retrieved documents, thereby improving the utilization of documents for targeted knowledge creation.",0.0
3,"What LLM with a 32k token limit powers DuReader's improved RAG, and how does its F1 score match up?","[LLMs’ ability for complex reasoning. In DuReader\ndataset, with a token length of 16k, R2AG remains\neffective, demonstrating its robustness and effi-\nciency in handling extensive text outputs. These re-, LLMs’ ability for complex reasoning. In DuReader\ndataset, with a token length of 16k, R2AG remains\neffective, demonstrating its robustness and effi-\nciency in handling extensive text outputs. These re-, (1) Compared with foundation LLMs using stan-\ndard RAG, R2AG can significantly increase perfor-\nmance. Even in multi-hot datasets, R2AG improves\nLLMs’ ability for complex reasoning. In DuReader\ndataset, with a token length of 16k, R2AG remains, (1) Compared with foundation LLMs ...",I don't know.,"The foundation LLM for DuReader's improved RAG is Qwen1.50.5B with a 32k token limit, and its F1 score is 0.1395.",0.0
4,"What foundation LLM is used for the DuReader dataset, and how does its performance compare with other methods?","[Table 2: Performance comparison on DuReader dataset.\nas the foundation LLM for enhanced RAG methods,\nwhich has a maximum context length of 4k tokens.\nFor NQ-20 and NQ-30 datasets, LongChat1.57B\nis selected as the foundation LLM, which extends, Table 2: Performance comparison on DuReader dataset.\nas the foundation LLM for enhanced RAG methods,\nwhich has a maximum context length of 4k tokens.\nFor NQ-20 and NQ-30 datasets, LongChat1.57B\nis selected as the foundation LLM, which extends, datasets. For DuReader dataset, we measure per-\nformance by F1 score and Rouge (Lin, 2004).\n4.2\nBaselines\nTo fully evaluate R2AG, we compared two types of\nmethods: standard RAG using various LLM...","The foundation LLM used for the DuReader dataset is LongChat1.57B, and its performance is compared with other methods in terms of F1 score and Rouge (Lin, 2004).","The foundation LLM used for the DuReader dataset is Qwen1.50.5B, with a maximum context length of 32k tokens. It is categorized under frozen LLMs. In terms of performance, RAFT, a fine-tuned LLM, has an F1 score of 0.2423 and a Rouge score of 0.2740, while R2AG+RAFT, another fine-tuned LLM, shows slightly better performance with an F1 score of 0.2507 and a Rouge score of 0.2734. These results indicate that fine-tuned LLMs outperform the foundation LLM Qwen1.50.5B on the DuReader dataset.",0.0
