# Retrieval-Augmented Generation

We now turn our attention to Retrieval-Augmented Generation (RAG), which is a fundamental part of a lot of LLM applications. RAG consists on using the user query to search a database of information, usually from text documents, that can be provided to the model as context for generating text.

## Components of RAG

- **Encoding Procedure**

To be able to search the most relevant information, it first has to be converted into vectors that encode the semantic contents of each document. This is done through a transformer model that is able to detect the most important parts of a text and turn in into an n-dimensional vector, which each dimension representing an abstract semantic field. In our case, we will be using 1536 dimensions, as that is the output of the ada-2 encoding model provided by OpenAI.

- **Knowledge Base**

This is the database of information, usually in the form of semantically encoded vectors or **vector store**. It will be used for retrieving information relevant to the user query.

- **Retrieval Procedure**

By having the documents encoded into vectors and stored, we can perform a **semantic search**, which is the way we can find the most relevant information for each query. The way this is done is by first encoding the query the same way we encoded the documents, such that we obtain an n-dimensional vector with semantic information about the query. The vectors in the Vector Store that most closely resemble the query vector should also be the closest ones in terms of semantic content, and therefore be the most relevant ones. This means we can use simple mathematical operations to assess the similarity stored vectors and the query vector, and keep the most similar ones. A lot of applications use the cosine of the angle between the query vector and the stored vectors, and this is what we will be doing in this demonstration; other metrics can be used, like euclidian distance and dot product between the vectors.

## A simple architecture with RAG

We can now add retrieval to the previous app design:

![rag](_assets/rag.png)

What this does is give more specific information to the model for generating responses, which makes for higher quality answers

## Limits of RAG

There are some details that have to be taken into consideration to use RAG with LLMs. First of all, the quality of the generated responses is, at best, dependant on the quality of the texts. If the texts contain poor information or are poorly composed, this will deteriorate the quality of the responses.

Another issue arises from the limits of the LLMs themselves. They are, after all, Neural Networks, and their input layer has a specified amount of nodes. The input that an LLM takes is called a token, which is similar to the text but transformed into numbers, each of which represents a grouping of letters similar to morphemes in language. For the gpt-3.5-turbo-1106 model that we have been using, the maximum amount of tokens that can be passed into the model is 16,385. Therefore, our relevant text must be short enough that it fits withing this limit, together with the rest of the message contents and the chat history. [This website](https://www.tokencounter.io/) helps calculate the amount of tokens that a given text produces. Some documents might wholy fit within the token limit, but longer ones will not. For these situations, we can split the texts into "chunks", and the semantic search proceedure will produce the most relevant chunks.

## Applying RAG

The first general objective we have now is creating a database with the necessary information to answer our queries. To do this, we follow these steps:

1. Load the documents and turn them into strings
2. Split each document string into smaller chunks
3. Encode the chunks
4. Add the encoded chunks into a vector store

First we will load the necessary libraries: `pypdf` for reading the pdfs and turning them into strings, and `chromadb` for creating and managing the vector store.

In [1]:
# !pip install chromadb pypdf
# !pip install pypdf
import chromadb
from pypdf import PdfReader

We can turn each PDF into a string through the `PdfReader` class. This class has an iterable attribute called `pages` that contains all the pages in the document. We have to get the text for each page with the `extract_text` method and then join all the texts.

In [2]:
def pdfToString(path):
        loadedPdf = PdfReader(path)
        pdfString = ""
        for page in loadedPdf.pages:
                pdfString += page.extract_text()
        
        return pdfString

We will be using two texts. The first one is Verner and Tirole (2010) on the economics of Open Source software development. The second one is by Vasswani et al. (2017)'s seminal work on language transformers and leveraging attention for efficient language processing.

In [3]:
doc1 = pdfToString("documents/Verner and Tirole (2010).pdf")
doc2 = pdfToString("documents/Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin (2017).pdf")

Now that we have our documents as strings of text, we can split them into chunks, each of a certain length. To determine the length we can just use the built in `len` function, but a better way might be to count the tokens themselves. This can be done through the `tiktoken` package and easily integrated into a splitting function with `RecursiveCharacterTextSplitter` provided by the `langchain` package, so we will be loading them both.

In [4]:
# !pip install langchain
from tiktoken import get_encoding
from langchain.text_splitter import RecursiveCharacterTextSplitter

First we build a function that tokenizes a text and counts the amount of tokens

In [5]:
tokenizer = get_encoding("cl100k_base")
def count_tokens(text):
	return len(tokenizer.encode(text))

print(count_tokens(doc1))
print(count_tokens(doc2))

14972
10236


When we count the amount of tokens, both texts can potentially fit within the limit allowed by the model. Therefore we do not need to split them into chunks. Just for demonstration purposes, we still will split them into chunks of about 8192 tokens, which is half the model's limit.

In [6]:
text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 8192,
        chunk_overlap = 150,
        length_function = count_tokens
)
texts = text_splitter.create_documents(
        [doc1, doc2], 
        metadatas = [
                {"authors": "Verner and Tirole", "year": "2010"},
                {"authors": "Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin", "year": "2017"}
        ]
)


Chroma functions locally, but there exist cloud-based solutions for vector database storage like Pinecone. The way Chroma works is by it creates a directory with the index. This directory can be ephemeral and disappear after the program is ran, or it can be persistent and remain in storage for later use.

First, we have to embed the documents. We could either do this through the embedding classes provided by the `openai` package. In this case, we will be using the embeddings function class provided by `chromadb`; both packages make a call to the same API endpoint, so the results are the same in either case.

In [7]:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from Constants import OPENAI_API_KEY
openai_ef = OpenAIEmbeddingFunction(
        api_key=OPENAI_API_KEY,
        model_name="text-embedding-ada-002"
)

We will supply this function to the collection, so that both our documents and our queries use this embedding system

In [8]:
chromaClient = chromadb.PersistentClient()
collection = chromaClient.create_collection(
        name="my_collection",
        metadata={"hnsw:space": "cosine"},
        embedding_function=openai_ef
)

If the collection has already been created, you can open it with the `get_collection` method. You must supply the embedding function again to do so

In [None]:
# collection = chromaClient.get_collection(
#         name="my_collection",
#         embedding_function=openai_ef
# )

Now we can add the documents. When these are provided in the `add` method, they are tokenized and embedded with the function we provided earlier. We must also supply ids for each document, and we may optionally provide metadata for each of them. Our `texts` object has a list of `Document` objects, each with a `page_content` and a `metadata` attributes. We can iterate along this list to provide the information to be added to our knowledge base.

In [9]:
collection.add(
        documents=[document.page_content for document in texts],
        metadatas=[document.metadata for document in texts],
        ids=[f"id{i+1}" for i in range(len(texts))]
)

Now we can query the database. This will return us a dictionary that contains information about the most relevant documents according to the semantic search.

In [10]:
collection.query(query_texts=["what incentives exist in the open source development economy?"], n_results=2)["documents"][0][0]

"NBER WORKING PAPER SERIES\nTHE SIMPLE ECONOMICS OF OPEN SOURCE\nJosh Lerner\nJean Tirole\nWorking Paper  7600\nhttp://www.nber.org/papers/w7600\nNATIONAL BUREAU OF ECONOMIC RESEARCH\n1050 Massachusetts Avenue\nCambridge, MA 02138\nMarch 2000\nThe assistance of the Harvard Business School’s California Research Center, and Chris Darwall in particular,\nwas instrumental in the development of the case studies and is gratefully acknowledged.  We also thank anumber of practitioners—especially Eric Allman, Mike Balma, Brian Behlendorf, Keith Bostic, TimO’Reilly, and Ben Passarelli—for their willingness to generously spend time discussing the open sourcemovement.  Jacques Crémer, Bernard Salanié, and Rob Merges provided helpful comments.  HarvardBusiness School’s Division of Research provided financial support.  The Institut D'Economie Industriellereceives research grants from a number of corporate sponsors, including Microsoft Corporation.  All opinionsand errors, however, remain our own. \n

Now that we have finished preparing our knowledge base, our next goal will be to build an application that finds the most relevant document for our query, provides it to the language model, and generates a reponse with the information provided. The steps to do this are very similar to the basic application we created previously, with the added step of searching the index. First, we must create a system prompt that reflects this task.

In [11]:
query_delimiter = "####"
context_delimiter = "++++"

system_prompt = f"""
You are an assistant tasked with responding to a user's query. For this end, you will be supplied a fragment of an academic paper that is relevant to the user's query.

The query will be delimited by {query_delimiter} characters

The academic paper fragment will be delimited by {context_delimiter} characters

Your answer must only rely on information provided by the academic paper fragment provided

If the fragment is not relevant or does not contain the necessary information, make sure to reflect that in your answer
"""

Now we build a function that takes the query, performs a semantic search, adds the most relevant document to the user message content, and makes an API request to generate a response

In [12]:
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)
model = "gpt-3.5-turbo-1106"

def ragProcedure(query, n_results = 1):
        relevantText = collection.query(
                query_texts=[query],
                n_results=n_results
        )['documents'][0][0]
        userMessageContent = f"{query_delimiter}{query}{query_delimiter}\n\n{context_delimiter}{relevantText}{context_delimiter}"
        messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": userMessageContent}
        ]
        response = client.chat.completions.create(
                messages = messages,
                model = model
        )
        return response.choices[0].message.content
        


In [13]:
print(ragProcedure("what incentives exist in the open source development economy?"))

The academic paper "The Simple Economics of Open Source" by Josh Lerner and Jean Tirole explores the economics of open source software development. It discusses motivations and incentives for programmers in both open source and closed source environments. The paper suggests that programmers in open source projects may be motivated by career concerns, ego gratification, and signaling incentives, which can include peer recognition, future job opportunities, and shares in commercial open source-based companies. The visibility of performance, the measurability of effort, and the fluidity of the labor market in open source environments may enhance the signaling incentives for programmers.

While the paper provides insights into the motivations and incentives of programmers in the open source development economy, it does not specifically outline a list of incentives existing in the open source development economy. Therefore, it does not directly provide a clear, concise list of incentives. I

In [15]:
print(ragProcedure("how do transformers compare to convolution or recurrence?"))

The academic paper fragment you provided is titled "Attention Is All You Need" by Ashish Vaswani et al. In this paper, the authors introduce the Transformer model, which is based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The paper discusses how the Transformer model achieves superior performance in machine translation tasks by relying on self-attention to draw global dependencies between input and output, without using sequence-aligned recurrent neural networks (RNNs) or convolution.

The comparison between transformers, convolution, and recurrence is addressed in the paper:
- The authors propose the Transformer model as an alternative to complex recurrent or convolutional neural networks, highlighting its superiority in terms of quality, parallelizability, and training time.
- They indicate that recurrent models factor computation along the symbol positions of the input and output sequences, inherently leading to sequential nature and limiti