# Retrieval Augmented Generation (RAG)

When it comes to answering prompts LLMs may not always return a factual response. This is especially true if the prompt asks the LLM information about something it may not necessarily know. For instance, ChatGPT was trained on data until [September 2021](https://help.openai.com/en/articles/6825453-chatgpt-release-notes). So if we ask it about events after this time period it might respond with a statement that it doesn't have knowledge of the event, or, it might flat out hallucinate something.

A simple way to handle this issue is to provide model with relevant context as seen in the examples below. Retrieval Augmented Generation or RAG can be broken down into three steps - retrieval, augmentation, and generation. We first *retrieve* relevant data based on the prompt. This data can be stored locally, in memory, in a database etc. We can employ different search algorithms such as a keyword search, a semantic search or a hybrid search to find the relevant data. Then we *augment* our prompt by appending this data as context. Finally, we send the completed prompt to the LLM to *generate* an answer.

You can read more about this area in the [Georgian AI Library](https://github.com/georgian-io/GAL/blob/main/Prompting%20Tools%20%26%20Techniques/information_retrieval.md).

In [1]:
# Load environment variables
from dotenv import load_dotenv

load_dotenv("../../.env")

True

In [2]:
from tools import llm_call

## A Failure Case

Examples of model responses on information outside the training data.

In [4]:
prompt = """Who won the 2022 FIFA World Cup?"""

print("----------PROMPT----------")
print(prompt)
print("---------RESPONSE---------")
result = llm_call(model="gpt-4", prompt=prompt, parameters={"temperature": 0.0})
print(result)

----------PROMPT----------
Who won the 2022 FIFA World Cup?
---------RESPONSE---------
As of my last update in October 2021, the 2022 FIFA World Cup has not yet taken place. It is scheduled to be held in Qatar in November and December 2022.


In [7]:
prompt = """Who won the 2022 FIFA World Cup?"""

print("----------PROMPT----------")
print(prompt)
print("---------RESPONSE---------")
result = llm_call(model="chat-bison", prompt=prompt, parameters={"temperature": 0.0})
print(result)

----------PROMPT----------
Who won the 2022 FIFA World Cup?
---------RESPONSE---------
France


GPT-4 as it states above, had training data until October 2021 and declines to answer. PaLM (chat-bison) which uses training data until [Feb 2023](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/models#foundation_models) gives us the wrong answer! While France was a finalist, Argentina won the World Cup!

## Now With Context

We ask the same question to both model again but this time we use context from [Wikipedia](https://en.wikipedia.org/wiki/2022_FIFA_World_Cup).

In [8]:
context = """
The 2022 FIFA World Cup was the 22nd FIFA World Cup, the world championship for national football teams organized by FIFA. It took place in Qatar from 20 November to 18 December 2022, after the country was awarded the hosting rights in 2010. It was the first World Cup to be held in the Arab world and Muslim world, and the second held entirely in Asia after the 2002 tournament in South Korea and Japan.[A]

This tournament was the last with 32 participating teams, with the number of teams being increased to 48 for the 2026 edition. To avoid the extremes of Qatar's hot climate,[B] the event was held during November and December.[C] It was held over a reduced time frame of 29 days with 64 matches played in eight venues across five cities. Qatar entered the event—their first World Cup—automatically as the host's national team, alongside 31 teams determined by the qualification process.

Argentina were crowned the champions after winning the final against the title holder France 4–2 on penalties following a 3–3 draw after extra time. It was Argentina's third title and their first since 1986, as well being the first nation from outside of Europe to win the tournament since 2002. French player Kylian Mbappé became the first player to score a hat-trick in a World Cup final since Geoff Hurst in the 1966 final and won the Golden Boot as he scored the most goals (eight) during the tournament. Argentine captain Lionel Messi was voted the tournament's best player, winning the Golden Ball. The tournament has been considered exceptionally poetic as the capstone of his career, for some commentators fulfilling a previously unmet criterion to be regarded the greatest player of all time.[4] Teammates Emiliano Martínez and Enzo Fernández won the Golden Glove, awarded to the tournament's best goalkeeper; and the Young Player Award, awarded to the tournament's best young player, respectively. With 172 goals, the tournament set a record for the highest number of goals scored in the 32-team format, with every participating team scoring at least one goal.

The choice to host the World Cup in Qatar attracted significant criticism, with concerns raised over the country's treatment of migrant workers, women and members of the LGBT community, as well as Qatar's climate, lack of a strong football culture, scheduling changes, and allegations of bribery for hosting rights and wider FIFA corruption.[D]
"""

In [9]:
prompt = f"""Who won the 2022 FIFA World Cup?\nContext:\n{context}"""

print("----------PROMPT----------")
print(prompt)
print("---------RESPONSE---------")
result = llm_call(model="gpt-4", prompt=prompt, parameters={"temperature": 0.0})
print(result)

----------PROMPT----------
Who won the 2022 FIFA World Cup?
Context:

The 2022 FIFA World Cup was the 22nd FIFA World Cup, the world championship for national football teams organized by FIFA. It took place in Qatar from 20 November to 18 December 2022, after the country was awarded the hosting rights in 2010. It was the first World Cup to be held in the Arab world and Muslim world, and the second held entirely in Asia after the 2002 tournament in South Korea and Japan.[A]

This tournament was the last with 32 participating teams, with the number of teams being increased to 48 for the 2026 edition. To avoid the extremes of Qatar's hot climate,[B] the event was held during November and December.[C] It was held over a reduced time frame of 29 days with 64 matches played in eight venues across five cities. Qatar entered the event—their first World Cup—automatically as the host's national team, alongside 31 teams determined by the qualification process.

Argentina were crowned the champion

In [10]:
prompt = f"""Who won the 2022 FIFA World Cup?\nContext:\n{context}"""

print("----------PROMPT----------")
print(prompt)
print("---------RESPONSE---------")
result = llm_call(model="chat-bison", prompt=prompt, parameters={"temperature": 0.0})
print(result)

----------PROMPT----------
Who won the 2022 FIFA World Cup?
Context:

The 2022 FIFA World Cup was the 22nd FIFA World Cup, the world championship for national football teams organized by FIFA. It took place in Qatar from 20 November to 18 December 2022, after the country was awarded the hosting rights in 2010. It was the first World Cup to be held in the Arab world and Muslim world, and the second held entirely in Asia after the 2002 tournament in South Korea and Japan.[A]

This tournament was the last with 32 participating teams, with the number of teams being increased to 48 for the 2026 edition. To avoid the extremes of Qatar's hot climate,[B] the event was held during November and December.[C] It was held over a reduced time frame of 29 days with 64 matches played in eight venues across five cities. Qatar entered the event—their first World Cup—automatically as the host's national team, alongside 31 teams determined by the qualification process.

Argentina were crowned the champion

And there we have it! Giving context to the model helps it answer questions for us. We can use this in other scenarios too - such as answering questions from private datasets or internal documentation. We just give the relevant documents and get an answer! Pretty simply right?

...right?

Well...

## Retrieval

While giving context certainly helps the model, a big challenge is to figure out what context we can give the model in an automated fashion. Using the above example of "Who won the 2022 FIFA World Cup?", we would need a way to identify the correct document from all the available documents. That is, we have a search problem. Given thousands or even millions of documents, how do we retrieve the right one(s)? 

In the above case, we need to ensure we retrieve the Wikipedia page for the [2022 FIFA World Cup](https://en.wikipedia.org/wiki/2022_FIFA_World_Cup), not the [2026 FIFA World Cup](https://en.wikipedia.org/wiki/2026_FIFA_World_Cup) or the [2023 Rugby World Cup](https://en.wikipedia.org/wiki/2023_Rugby_World_Cup) or the [2021 FIFA Futsal World Cup](https://en.wikipedia.org/wiki/2021_FIFA_Futsal_World_Cup). The following examples illustrates a simple scenario where we use retrieval with [QDrant](https://qdrant.tech/) as a vector database. 

For more information on the retrieval strategies, you can check out the Georgian AI Library. For a more holistic perspective of a RAG system along side engineering considerations such as scale and latency, we'll have a session comparing different vector databases in tomorrow's session! And finally, to learn more about using vector databases in practice, we have a demo session from the folks at QDrant this week.

The Plan:
1. Take a toy dataset and obtain embeddings for each document. 
2. Store the embeddings in the database.
3. Create a prompt to send to the model.
4. Search our embeddings for the document most relevant to the prompt.
5. Add the most relevant document as context to the prompt.
6. Send the final prompt to the model.
7. Get the answer we want!

The rest of this demo is based on this [demo](https://colab.research.google.com/drive/1Bz8RSVHwnNDaNtDwotfPj0w7AYzsdXZ-) from QDrant.

In [4]:
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
import json

### Data

For our list of documents we just use a toy dataset consisting of the Wikipedia blobs from the above example. In typical usecases you might have your own set of documents you want to use or an existing dataset.

In [5]:
with open("rag_dataset.json", "r") as fp:
    documents = json.load(fp)

We have an associated title and description for each document.

In [7]:
[x["title"] for x in documents]

['2026 FIFA World Cup',
 '2022 FIFA World Cup',
 '2023 Rugby World Cup',
 '2021 FIFA Futsal World Cup',
 'Poutine',
 'Canada',
 'Ice Hockey World Championships',
 'Pizza Margherita',
 'Mozzarella',
 'Pizza']

In [10]:
documents[4]["description"]

'Poutine (Quebec French: [put͡sɪn] ⓘ) is a dish of french fries and cheese curds topped with a brown gravy. It emerged in Quebec, in the late 1950s in the Centre-du-Québec region, though its exact origins are uncertain and there are several competing claims regarding its invention. For many years, it was used by some to mock Quebec society.[1] Poutine later became celebrated as a symbol of Québécois culture and the province of Quebec. It has long been associated with Quebec cuisine, and its rise in prominence has led to its growing popularity throughout the rest of Canada.\n\nAnnual poutine celebrations occur in Montreal, Quebec City, and Drummondville, as well as Toronto, Ottawa, New Hampshire, and Chicago. It has been called "Canada\'s national dish", though some critics believe this labelling represents cultural appropriation of the Québécois or Quebec\'s national identity. Many variations on the original recipe are popular, leading some to suggest that poutine has emerged as a new 

### Embeddings
In math, we can represent a point in 2D space (X, Y) where the first number is the value in the first dimension and the second number is the value in the second dimension. You can think of an embedding as representing a concept in a multidimensional space (anywhere from 32 dimensions to a whopping 1536 dimensions!). Each dimension here represents some aspect of this concept. While the actual aspect varies, we can think of it as dimensions representing things like colors, shapes or feelings.

### Embedding Model
The [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) compares different embedding models. We pick a small [model](https://huggingface.co/thenlper/gte-small) from this list for the demo.

### Vector Database

We aren't going to talk much about vector databases today. That's in future sessions (one tomorrow and a hands-on demo from QDrant later!). For now we're just storing things in memory using QDrant.

In [11]:
embedding_model = SentenceTransformer('thenlper/gte-small')

Downloading (…)2e6d8/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)9e0ce2e6d8/README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

Downloading (…)0ce2e6d8/config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/66.8M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)2e6d8/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

Downloading (…)9e0ce2e6d8/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)ce2e6d8/modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

In [22]:
# This creates a QDrant instance in memory
database = QdrantClient(":memory:")

In [23]:
# Create a collection to store our documents
database.recreate_collection(
    collection_name="wikipedia",
    vectors_config=models.VectorParams(
        size=embedding_model.get_sentence_embedding_dimension(),
        distance=models.Distance.COSINE,
    )
)

True

In [24]:
# Vectorize our documents and upload to the database
database.upload_records(
    collection_name="wikipedia",
    records=[
        models.Record(
            id=idx,
            vector=embedding_model.encode(doc["description"]),
            payload=doc
        ) for idx, doc in enumerate(documents)
    ]
)

### Searching our DB

Now that we have everything pushed to the DB, we can now run our search. We're going to now search for the same query from before.

In [27]:
prompt = """Who won the 2022 FIFA World Cup?"""

In [29]:
hits = database.search(
    collection_name="wikipedia",
    query_vector=embedding_model.encode(prompt),
    limit=3 # Get the top 3 results
)
for hit in hits:
  print(hit.payload['title'], "score:", hit.score)

2022 FIFA World Cup score: 0.9491587
2026 FIFA World Cup score: 0.8985785
2023 Rugby World Cup score: 0.87594485


In [30]:
prompt += f"""\nContext:\n{hits[0].payload["description"]}"""

print("----------PROMPT----------")
print(prompt)
print("---------RESPONSE---------")
result = llm_call(model="gpt-4", prompt=prompt, parameters={"temperature": 0.0})
print(result)

----------PROMPT----------
Who won the 2022 FIFA World Cup?
Context:
The 2022 FIFA World Cup was the 22nd FIFA World Cup, the world championship for national football teams organized by FIFA. It took place in Qatar from 20 November to 18 December 2022, after the country was awarded the hosting rights in 2010. It was the first World Cup to be held in the Arab world and Muslim world, and the second held entirely in Asia after the 2002 tournament in South Korea and Japan.[A]

This tournament was the last with 32 participating teams, with the number of teams being increased to 48 for the 2026 edition. To avoid the extremes of Qatar's hot climate,[B] the event was held during November and December.[C] It was held over a reduced time frame of 29 days with 64 matches played in eight venues across five cities. Qatar entered the event—their first World Cup—automatically as the host's national team, alongside 31 teams determined by the qualification process.
    
Argentina were crowned the champ

And there we have it! We've searched through a bunch of documents, retrieved the right one, added it as context to the prompt and passed it to the LLM!

This is just an introductory example to RAG and vector DBs. Stay tuned for more sessions on taking this further!

If you'd like to get started right now, QDrant has a whole bunch of [examples](https://github.com/qdrant/examples/) for you to go through. We suggest starting with this [one](https://github.com/qdrant/examples/blob/master/rag-openai-qdrant/rag-openai-qdrant.ipynb)!