# Build a RAG system on arXiv papers with source references

When building a RAG system it may be important to show users the sources that were used to generate the answer.  

In this quick tutorial, we'll build a RAG system that uses arXiv papers for the source of context that we can give to a LLM, and the response will contain links to arXiv papers where the context was taken from.

The papers on arXiv are in PDF, which won't be an issue, since we'll be using [Unstructured.io](https://unstructured.io/) for document preprocessing. We'll be building RAG using LangChain that has a very simple method to return the LangChain Documents that were retrieved in each generation.

And, because unstructured.io enriches extracted text with metadata, we'll be able to leverage the Documents' metadata to build links back to the papers.

Let's go!


## Setup

* Install the required libraries
* Get an [Unstructured API key](https://unstructured.io/api-key-free), free tier will work (capped at 1000 pages)
* Get your HuggingFace token (depending on a model you choose to use, you may not need it). You can get one in your [profile's settings](https://huggingface.co/settings/tokens).

In [None]:
!pip install -q unstructured-client unstructured[pdf] langchain chromadb huggingface_hub sentence-transformers arxiv langchain_community bitsandbytes accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.7/224.7 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.8/80.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━

In [None]:
from unstructured_client import UnstructuredClient
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
import arxiv
import tqdm
import glob
from unstructured.staging.base import dict_to_elements
from unstructured.chunking.title import chunk_by_title
from typing import List

## Preprocessing papers from arXiv


In [None]:
import os

# Add your Unstructured API key here
os.environ["UNSTRUCTURED_API_KEY"] = ""

Let's define a function that will fetch the specified number of papers matching a query from arXiv, and then send them to Unstructured API to extract document elements, such as text, titles, lists, tables, footers, and so on.

In [None]:
def get_arxiv_paper_texts(query: str, max_results: int = 10) -> List[str]:

    # Get list of arxiv papers matching given query using Arxiv API
    arxiv_client = arxiv.Client()

    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.Relevance,
        sort_order=arxiv.SortOrder.Descending,
        )


    client = UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"))

    paper_texts = []
    # Loop through PDFs, download, pre-process and then delete
    for paper in arxiv_client.results(search):
        paper.download_pdf()
        filename = glob.glob("*.pdf")[0]
        file = open(filename, "rb")

        req = shared.PartitionParameters(
            files=shared.Files(
                content=file.read(),
                file_name=filename,
                ),
            # hi_res strategy is the best choice for complex PDFs (e.g. with tables)
            # and for image-based files
            strategy="hi_res",
            )
        try:
          res = client.general.partition(req)
          if res.elements is not None:
            paper_texts += res.elements

        except SDKError as e:
          print(e)

        os.remove(filename)
    return paper_texts

In this example, we're getting top 10 papers that match "RAG" search query:

In [None]:
# Depending on the number of papers you process, this may take from a few seconds to minutes
elements = get_arxiv_paper_texts("RAG", 10)

## Chunking preprocessed PDFs

If you are familiar with chunking methods that split long text documents into smaller chunks, you'll notice that Unstructured methods slightly differ, since the partitioning step already divides an entire document into its structural elements: titles, list items, tables, text, etc. This helps to avoid a situation where unrelated pieces of text end up in the same chunk.  

With Unstructured chunking, individual elements will only be split if they exceed the desired maximum chunk size. You can also choose to combine consecutive text elements that will together fit within max_characters.

In [None]:
staged_elements = dict_to_elements(elements)

In [None]:
chunked_elements = chunk_by_title(staged_elements,
                                  max_characters=512,
                                  # You can choose to combine consecutive elements that are too small
                                  # e.g. individual list items
                                  combine_text_under_n_chars=200,
                                  )


## Creating ChromaDB retriever

In [None]:
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain.embeddings import HuggingFaceEmbeddings

First, convert chunked Unstructured elements into LangChain documents

In [None]:
documents = []
for chunked_element in chunked_elements:
    metadata = chunked_element.metadata.to_dict()
    metadata["source"] = metadata["filename"]
    del metadata["languages"]
    documents.append(Document(page_content=chunked_element.text, metadata=metadata))

Next, choose your embedding model (make sure the chunk size you have specified earlier fits in the embedding model's context window) and set up your vector store, and a retriever based on it.

In [None]:
from langchain.vectorstores import utils as chromautils

# ChromaDB doesn't support complex metadata, e.g. lists, so we drop it here.
# If you're using a different vector store, you may not need to do this
docs = chromautils.filter_complex_metadata(documents)

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Setting Up RAG with LangChain

Llama-3-8B-Instruct requires a user to be authenticated. Provide your HF token, or pick an alternative model to use for text generation.

In [None]:
from huggingface_hub.hf_api import HfFolder

# Add your Hugging Face token here
HfFolder.save_token('')

In [None]:
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from langchain.chains import RetrievalQA

In [None]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

# The quantized version of the model can run on the free T4 provided in Colab.
# Without quantization, you will need a beefier machine.

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=200,
    eos_token_id=terminators,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|start_header_id|>user<|end_header_id|>
You are an assistant for answering questions using provided context.
You are given the extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "I do not know." Don't make up an answer.
Question: {question}
Context: {context}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)


qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=retriever,
    # Set return_source_documents to True to include the retrieved documents in a response
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
  warn_deprecated(


In [None]:
# When partitioning documents, Unstructured enriches the document elements with metadata.
# Here we will use this metadate to extract `paper_id` from `filename` and build a link to the paper on Arxiv

import re

def response_with_links(question):
  sources = []
  response = qa_chain.invoke(question)
  answer = response['result']
  for source in response['source_documents']:
    match = re.search(r"(\d+\.\d+)", source.metadata['filename'])
    if match:
      paper_id = match.group(1)

    arxiv_link = f"https://arxiv.org/abs/{paper_id}"
    sources.append(arxiv_link)
  return {"answer": answer, "sources": sources}


In [None]:
llm_response = response_with_links("What is a RAG system?")
print(llm_response["answer"])
print("Sources: ")
print(llm_response["sources"])

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Based on the provided context, I can tell you that the RAG system refers to a Retrieval-Augmentation-Generation system. According to the text, it's a system that consists of two primary components: Retrieval and Generation. The Retrieval component extracts relevant information from external knowledge sources, involving two main phases - indexing and searching. The Generation component produces the required contents based on the retrieved information. 

The RAG system has been studied extensively, and numerous enhancements have been proposed over time. As shown in Figure 1, the system consists of two core modules: the Retriever and the Generator. The process unfolds as follows: the Retriever searches for relevant information, then feeds the original query and retrieval results into the Generator through a specific augmentation methodology, and finally, the Generator produces the required contents.
Sources: 
['https://arxiv.org/abs/2403.09040', 'https://arxiv.org/abs/2405.13576', 'https: