# Retrieval-Augmented Generation (RAG)

The #1 use case for Large Language Models (LLMs) today is putting them over internal documents to make information in those documents easily discoverable. Retrieval-augmented generation (RAG) is a technique that allows LLMs to work with any number of documents of any size. LLMs limit how much text can be input to them in a single call. RAG involves dividing documents into "chunks" of (typically) a few hundred words each and generating an embedding vector for each chunk. To answer a question, you generate an embedding vector from the question, identify the *n* most similar embedding vectors, and provide the corresponding chunks of text to the LLM. To improve results, you can use a *reranker* to determine which chunks are the most relevant. Let's demonstrate with a CSV file containing more than 8,000 reviews of BMWs.

Start by creating a ChromaDB database to hold the reviews:

In [1]:
import chromadb

client = chromadb.PersistentClient('chroma')
collection = client.create_collection(name='Car_Reviews')

Now read the CSV file, concatenate the vehicle title and review in each row, and insert the resulting text into the database. Embedding vectors are generated automatically by ChhromaDB. Note that this code crashes the kernel with some versions of ChromaDB. Crashes can usually be averted by downgrading the Python package named `chroma-hnswlib` to version 0.7.3 — for example `pip install chroma-hnswlib==0.7.3`.

In [2]:
import re
import pandas as pd

BATCH_SIZE = 100
documents = []
ids = []

df = pd.read_csv('Data/bmw.csv', engine='python')

for i, row in df.iterrows():
    vehicle = row['Vehicle_Title']
    review = row['Review']

    if vehicle and len(vehicle) > 8 and review and len(review) > 128:
        review = re.sub(r'[\r\n\t]+', ' ', review)
        review = re.sub(r'\s{2,}', ' ', review)
        review = review.strip()
        
        text = f'Model: {vehicle}\nReview: {review}'
        documents.append(text)
        ids.append(f'{i:05}')
        
        if (i + 1) % BATCH_SIZE == 0 and len(documents) > 0:
            collection.add(documents=documents, ids=ids)
            documents = []
            ids = []

if len(documents) > 0:
    collection.add(documents=documents, ids=ids)

Show the first 5 items added to the database:

In [3]:
items = collection.get()

for document in items['documents'][:5]:
    print(document)
    print('-' * 40)

Model: 2014 BMW X1 SUV xDrive28i 4dr SUV AWD (2.0L 4cyl Turbo 8A)
Review: We picked this car as a CPO from a BMW dealer. No problems to date. Relatively decent handling for the type car, and just the right size and height. We stayed away from the newer version to avoid a Mini based front wheel platform derived car as opposed to a BMW based one. The new 2018 X1 also features a flimsy cloth covering under the moonroof that I can see ripping quickly. If you want a X1 like this may I suggest springing for the upgraded interior. Base level seating has poor lateral support and are unacceptable. Base halogen headlights are unacceptable and should be avoided. My 3 series coupe has xenons which are "light years" better. Sorry!:-) Base sound system is "media-ocre" at best. I don't understand the need for moonroofs on BMWs, and wish they would apportion the monies elsewhere and delete them. Acceleration with the 240 hp is considerably better than you would expect for a 4 cyl. It feels like a 6! B

Query the database and show the top 5 items that are likely to contain an answer to a question:

In [4]:
results = collection.query(
    query_texts=['How reliable are BMWs? Are they expensive to work on?'],
    n_results=5
)

documents = list(reversed(results['documents'][0]))
scores = list(reversed(results['distances'][0]))

for index, document in enumerate(documents):
    print(f'Score: {scores[index]}')
    print(document)
    print('-' * 40)

Score: 0.807755708694458
Model: 2000 BMW 3 Series Sedan 328i 4dr Sedan
Review: Purchased used with 33,000 miles. This is my second, and how disappointed I am. Way too many problems, with everything from the windows to the braking system, to the engine and etc. 14 trips to the dealer for repairs since 10/2003. I would never buy another, and am looking into an Audi or Mercedes. I do have to admit the the MPG is awesome, at times I have clocked 40+. BMW is not what it used to be!! BMW really needs to work on the reliability. This car is way too expensive to maintain!
----------------------------------------
Score: 0.7931195497512817
Model: 2004 BMW M3 Coupe 2dr Coupe (3.2L 6cyl 6M)
Review: I have been driving my M3 for the last 4 years. Truly amazing car. So responsive with exceptional handling. It is turning like on a rail I am not kidding. SMG is amazing once you get used to it - takes a while but once mastered you will love it. So much fun driving in the city. I am getting around 18/25

## Use an LLM to answer questions

The next step is to use the database to retrieve relevant chunks and pass them to an LLM for question answering. Define a function that accepts a question, retrieves the 20 most relevant chunks from the database, and passes the question and the chunks to `GPT-4o`:

In [5]:
from openai import OpenAI

def answer_question(question):
    # Retrieve relevant chunks from the database
    client = chromadb.PersistentClient('chroma')
    collection = client.get_collection(name='Car_Reviews')    

    results = collection.query(
        query_texts=[question],
        n_results=20
    )

    # Concatenate the chunks
    documents = results['documents'][0]
    context = '\n\n'.join(documents)

    # Submit the question and the chunks to an LLM and stream the response
    client = OpenAI(api_key='OPENAI_API_KEY')

    content = f'''
        Answer the following question using the provided context, and if the
        answer is not contained within the context, say "I don't know." Explain
        your answer if possible. Do not mention the provided context in your
        output. Do not use markdown formatting.
        
        Question:
        {question}

        Context:
        {context}
        '''

    messages = [{ 'role': 'user', 'content': content }]

    response = client.chat.completions.create(
        model='gpt-4o',
        messages=messages,
        stream=True
    )

    for chunk in response:
        content = chunk.choices[0].delta.content
        if content is not None:
            print(content, end='')    

Ask the LLM a question about BMWs:

In [6]:
answer_question('How reliable are BMWs? Are they expensive to work on?')

The reliability of BMWs seems to vary significantly across different models and individual experiences. Some owners report high reliability and minimal issues, while others experience frequent mechanical problems and costly repairs. Generally, maintenance costs for BMWs are considered high, and parts can be expensive. Therefore, while some BMWs are reliable for their owners, they are overall expensive to maintain and repair.

Ask another question:

In [7]:
answer_question('What are three good reasons to buy a BMW? What are three good reasons NOT to buy a BMW?')

I don't know.

Make sure the LLM will answer "I don't know" to a question that has nothing to do with BMWs:

In [8]:
answer_question('Why is the sky blue?')

I don't know.

## Add reranking

Using the distance between embedding vectors to determine relevance isn't perfect. It's frequently helpful to retrieve candidate chunks from the database and then use a *reranker* to determine which chunks really are the most relevant. Rerankers come in many forms. The most commonly used rerankers are *cross encoders*, which are language models trained to quantify the relevance between two text samples. Let's load the `jina-reranker-v1-turbo-en` cross encoder from [Hugging Face](https://huggingface.co/jinaai/jina-reranker-v1-turbo-en) so we can use it for reranking:

In [9]:
from sentence_transformers import CrossEncoder

model = CrossEncoder('jinaai/jina-reranker-v1-turbo-en', trust_remote_code=True)

  from tqdm.autonotebook import tqdm, trange





Retrieve 10 chunks from the database to answer a question, and then use the reranker to pick the top 5. How do these chunks compare to 5 selected earlier?

In [10]:
question = 'How reliable are BMWs? Are they expensive to work on?'

results = collection.query(
    query_texts=[question],
    n_results=10
)

documents = results['documents'][0]
ranked_documents = model.rank(question, documents, return_documents=True, top_k=5)

for document in ranked_documents:
    print(f"Score: {document['score']}")
    print(document['text'])
    print('-' * 40)

Score: 0.31681326031684875
Model: 2004 BMW 3 Series Sedan 330i Rwd 4dr Sedan (3.0L 6cyl 6M)
Review: I've had about 10 different cars throughout my lifespan of only 29 years! I have owned three BMWs VWs a Toyota, Mazda and have driven many other vehicles. This is by far the BEST vehicle ever built. Way to go BMW. I had the 330i fully stocked with navigation, cold weather, premium and 6-speed manual. It was a rare find and I was stupid enough to trade it in and regret it everyday. The engine was silky smooth and soo powerful. IT handled like butter and could turn on a dime. The car felt like you were driving on roller coaster tracks everywhere you went. The gas mileage was excellent. These cars will also you a lifetime I've seen engines go for 300K miles. The inline 6 is so reliable and fun!
----------------------------------------
Score: 0.3021884560585022
Model: 2001 BMW 7 Series Sedan 740iL 4dr Sedan (4.4L 8cyl 5A)
Review: This is a great car! Especially powerful and classy. However, 

Rewrite the `answer_question` function to retrieve 40 chunks from the database and use the cross encoder to pick the best 20:

In [11]:
def answer_question(question):
    # Retrieve 40 relevant chunks from the database
    client = chromadb.PersistentClient('chroma')
    collection = client.get_collection(name='Car_Reviews')    

    results = collection.query(
        query_texts=[question],
        n_results=40
    )

    # Use the cross encoder to find the 20 most relevant chunks
    documents = results['documents'][0]
    ranked_documents = model.rank(question, documents, return_documents=True, top_k=20)
    context = '\n\n'.join(x['text'] for x in ranked_documents)

    # Submit the question and the chunks to an LLM and stream the response
    client = OpenAI(api_key='OPENAI_API_KEY')

    content = f'''
        Answer the following question using the provided context, and if the
        answer is not contained within the context, say "I don't know." Explain
        your answer if possible. Do not mention the provided context in your
        output. Do not use markdown formatting.
        
        Question:
        {question}

        Context:
        {context}
        '''

    messages = [{ 'role': 'user', 'content': content }]

    response = client.chat.completions.create(
        model='gpt-4o',
        messages=messages,
        stream=True
    )

    for chunk in response:
        content = chunk.choices[0].delta.content
        if content is not None:
            print(content, end='')   

Use the modified function to answer a question about BMWs:

In [12]:
answer_question('How reliable are BMWs? Are they expensive to work on?')

BMWs have a mixed reputation for reliability. Some owners report their BMWs as reliable and long-lasting, whereas others express dissatisfaction with frequent mechanical issues and time-consuming repairs. There seems to be a consensus that while BMWs offer excellent driving dynamics and build quality, they can be expensive to maintain and repair. Common issues noted include high costs of parts and maintenance, electronic component failures, and expensive repair bills, especially as the vehicles age. Hence, BMWs are generally considered to be expensive to work on.

Try again with a question that the LLM wasn't able to answer earlier:

In [13]:
answer_question('What are three good reasons to buy a BMW? What are three good reasons NOT to buy a BMW?')

Three good reasons to buy a BMW include the enjoyable driving experience, high-quality engineering, and luxurious interior features. Many reviewers describe BMWs as fun to drive, with excellent handling and performance characteristics, which align with BMW's reputation as the "ultimate driving machine." Additionally, the build quality and craftsmanship are often praised, indicating a high standard of engineering. Lastly, the interior design and comfort frequently receive positive comments, making BMWs a desirable choice for those seeking luxury features.

On the other hand, three good reasons not to buy a BMW include high maintenance costs, potential reliability issues, and poor resale value. Multiple reviewers mention the expensive upkeep associated with owning a BMW, with repair costs often being significantly higher than those for other brands. Some reviews also highlight reliability concerns, noting frequent repairs and mechanical issues. Finally, several owners express disappointm

Make sure the LLM will still admit it doesn't know the answer when the chunks provided to it don't contain the answer to a question:

In [14]:
answer_question('Why is the sky blue?')

I don't know.