# Getting started with RAG in Python

This notebook aims to give a basic introduction to Retrieval-Augmented Generation (RAG) with GPT-4 in Python. The intent is to give an extemely transparent (if simple) runthrough of a basic RAG setup in Python. It is not intended to act as a technical reference for any production RAG-based systems. In such cases you should consider using a dedicated vector database (e.g. Qdrant, Chroma, Vespa) and some dedicated LLM tooling such as LangChain.

Additionally, to use this notebook, you'll need an OpenAI API Key and billing set up for your OpenAI account.

## Prerequisites

To get started, make sure you're using Python 3.10 or greater. Install the following packages:

In [2]:
!pip install transformers openai torch scikit-learn



Next, add your API key:

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "your-api-key-here"

## Concepts

The basic RAG pattern involves two key components: the retriever and the generator. The generator is typically a LLM -- in this case it is going to be GPT-4. The retriever can be any external database, but is commonly centered on some form of vector database loaded with _embeddings_. However it is set up, the aim to have the retriever retrieve information in the form of relevant documents or snippets of documents from its data store and to use these to _augment_ the input (prompt) for the Generator in order to allow it to produce better responses. Sounds simple enough, right?


## The data

For this example, you'll use some short snippets about space missions in 2023. These were taken from [Wikipedia](https://en.wikipedia.org/wiki/2023_in_spaceflight) on 29th October 2023 with small modifications. At the time this notebook was written, GPT-4 had access to data up to October 2021. As such all of these events are outside of its parametric memory. Here are the documents:

In [None]:
!pip install datasets
from datasets import load_dataset

# Load datasets
gsm8k = load_dataset("gsm8k","main", split="test[:100]")  # Subset for quick testing
mbpp = load_dataset("mbpp", split="test[:100]")

In [21]:
documents = []  # Initialize an empty list for documents
documents.extend(gsm8k['question'])  # Add questions from gsm8k to the documents list
documents.extend(mbpp['text'])  # Add text from mbpp to the documents list

print(documents[:5])

["Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?", 'A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?', 'Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?', 'James decides to run 3 sprints 3 times a week.  He runs 60 meters each sprint.  How many total meters does he run a week?', "Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy.  She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed.  In the afternoon, she

In [3]:
documents = [
  "On 14 April, ESA launched the Jupiter Icy Moons Explorer (JUICE) spacecraft to explore Jupiter and its large ice-covered moons following an eight-year transit.",
  "ISRO launched its third lunar mission Chandrayaan-3 on 14 July 2023 at 9:05 UTC; it consists of lander, rover and a propulsion module, and successfully landed in the south pole region of the Moon on 23 August 2023.",
  "Russian lunar lander Luna 25 was launched on 10 August 2023, 23:10 UTC, atop a Soyuz-2.1b rocket from the Vostochny Cosmodrome, it was the first Russian attempt to land a spacecraft on the Moon since the Soviet lander Luna 24 in 1974, it crashed on the Moon on 19 August after technical glitches.",
  "JAXA launched SLIM (Smart Lander for Investigating Moon) lunar lander (carrying a mini rover) and a space telescope (XRISM) on 6 September.",
  "The OSIRIS-REx mission returned to Earth on 24 September with samples collected from asteroid Bennu.",
  "NASA launched the Psyche spacecraft on 13 October 2023, an orbiter mission that will explore the origin of planetary cores by studying the metallic asteroid 16 Psyche, on a Falcon Heavy launch vehicle."
]

## The retriever

In situations where you intend to use an off-the-shelf LLM (i.e. Generator), the retriever is the aspect of the system you have most control over. A common design for retrievers is to use a pre-trained language model to convert your documents into embeddings, and to then use a vector database to store these and query these embeddings during operation.

In this simple example, you'll use a pre-trained embedding model available through the `transformers` library from [Hugging Face](huggingface.co). This [pre-trained model](https://huggingface.co/BAAI/bge-base-en) is the English language version of the 'General Embedding' model from Beijing Academy of Artificial Intelligence (BAAI).

Here's a function that creates embeddings from a set of documents:

In [4]:
import torch
from transformers import AutoTokenizer, AutoModel


def embed_documents(docs, model_name):
  """Embed the provided documents to create a document index"""
  # load the tokenizer and model
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModel.from_pretrained(model_name)

  # encode the docs with the tokenizer
  encoded_docs = tokenizer(
      docs, padding=True, truncation=True,
      return_tensors='pt'
  )

  # generate your output embedding vectors
  with torch.no_grad():
      model_output = model(**encoded_docs)
      doc_embeddings = model_output[0][:, 0]

  # convert to numpy vectors for ease of use
  return doc_embeddings.numpy()

As you can see, there are two main elements here, the `tokenizer` and the `model`. You'll notice both use the same `model_name`. Language models often have their own tokenizers. This allows them to convert from human-readable natural language into machine-readable format compatible with the model. This machine-readable format is then used by the model to generate your embeddings.

You can now generate your document index as:

In [6]:
document_index = embed_documents(documents, model_name="BAAI/bge-base-en")
document_index[0].shape # shape of each vector in the index

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/719 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

(768,)

Note that it can be a good idea to 'chunk' your documents before embedding them and creating your document index. This can help reduce the context length subsequently required for the LLM you use, which in turn can improve inference speed and reduce cost. Additionally, it can help the resultant prompt reference specific facts or segments within a document more easily too, which can improve the quality of responses in some cases.

Now that you have your document index, you can create a simple retrieval function to search the index for matching documents:

In [7]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def retrieve_documents(query_string, doc_index, docs, k=5, doc_model_name="BAAI/bge-base-en"):
  # embed the query string to obtain a query vector
  query_vector = embed_documents(
      [query_string],
      model_name="BAAI/bge-base-en"
  ).reshape(1, -1)

  # use the query vector to find the most similar document to the query
  similarity = cosine_similarity(query_vector, doc_index).flatten()

  # return the top k most similar docs
  # here, argsort assigns the indices that would order the similarities
  # from least similar to most similar. The [::-1] slice reverses this
  # to return most similar to least, and slices the top k of these
  return [docs[i] for i in np.argsort(similarity)[::-1][:k]]

In this case you're using [cosine similarity](https://developers.google.com/machine-learning/clustering/similarity/measuring-similarity), a common method for comparing embedding vectors. The approach implemented here finds the similarity of your query vector to _every other document vector_. When you have lots of documents, this can be extremely expensive. It is in these circumstances that tools that provide _approximate_ search over the document index are useful. You can check out [FAISS](https://faiss.ai/index.html) from Facebook AI Research as an example of a tool that supports efficient similarity search over very large document indexes. For production environments, this is also where vector databases like [Qdrant](https://qdrant.tech/), [Chroma](https://www.trychroma.com/) or [Vespa](https://vespa.ai/) start to come in handy: they manage efficient similarity search for you!

With all that said, you can then test this simple Retriever (i.e. `retrieve_documents`) using:

In [24]:
example_retrieved_docs = retrieve_documents(
    "Tell me about the Japanese lunar mission.",
    document_index,
    documents,
    k=3
)

example_retrieved_docs

['James decides to run 3 sprints 3 times a week.  He runs 60 meters each sprint.  How many total meters does he run a week?',
 'A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?',
 'Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?']

Clearly there are not many documents in this document index. However, you should see that the top document is indeed most relevant to the query text: exactly what you want to see!

With this done, it is time to use the retrieved documents to create an augmented input for the LLM (i.e. create an augmented prompt). You'll use a very simple prompt in this case (you should think about ways to make it better!). Here's a simple function to achieve this:

In [25]:
def create_augmented_prompt(query_string, docs):
  # concatenate the retrieved docs as context for the LLM
  # you could do other pre-processing here too
  context = "\n".join(docs)
  # define your prompt template
  prompt_template = """Here is some relevant information:
  {context}

  Q: {query}
  A:
  """
  # render the prompt template
  return prompt_template.format(context=context, query=query_string)

And you can see how this behaves with the following:

In [26]:
example_augmented_prompt = create_augmented_prompt(
    "Tell me about the Japanese lunar mission.",
    example_retrieved_docs
)
example_augmented_prompt

'Here is some relevant information:\n  James decides to run 3 sprints 3 times a week.  He runs 60 meters each sprint.  How many total meters does he run a week?\nA robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?\nJosh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?\n\n  Q: Tell me about the Japanese lunar mission.\n  A:\n  '

The last important piece is querying the LLM itself. This is simple enough. Here is another simple function to query the OpenAI GPT-4 API:

In [27]:
from openai import OpenAI

client = OpenAI(api_key="sk-10756b0e11834102825b28fd79ba6680", base_url="https://api.deepseek.com")

# Query using DeepSeek
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    stream=False
)

print(response.choices[0].message.content)

The capital of France is Paris.


In [28]:
import openai

def generate_response(query_string, model_name):
  response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content":query_string},

    ],
    stream=False
)
  return response.choices[0].message.content

Once again, you can see how this behaves with:

In [13]:
generate_response("Hello, world!", model_name="deepseek-chat")

'Hello! How can I assist you today?'

Okay, you've now seen all of the core components of creating using RAG to query an LLM. Time to bring it all together! 🚀

In [14]:
def generate_rag_response(
    query_string,
    docs,
    doc_index,
    model_name="deepseek-chat",
    k=3
):

  # R: retrieve documents
  retrieved_docs = retrieve_documents(
      query_string, doc_index, documents
  )
  # A: create augmented prompt
  augmented_prompt = create_augmented_prompt(query_string, retrieved_docs)

  # G: generate response!
  generated_response = generate_response(augmented_prompt, model_name)
  return generated_response

Now generate a RAG response with:

In [22]:
generate_rag_response("Tell me about the status of the latest Indian lunar mission.", documents, document_index, model_name="deepseek-chat")

"The latest Indian lunar mission is called Chandrayaan-3. It is the successor to the Chandrayaan-2 mission, which had a failed landing attempt on the Moon's surface in September 2019. Chandrayaan-3 aims to achieve a successful soft landing on the Moon and conduct scientific experiments.\n\nAs of the latest updates, Chandrayaan-3 is in the planning and development stages. The Indian Space Research Organisation (ISRO) has not yet announced a specific launch date for the mission. However, it is expected to be launched in the near future, with the primary goal of demonstrating India's capability to perform a soft landing on the lunar surface and explore the Moon's South Pole region."

Up-to-date and accurate. Nice. Let's compare the RAG response to a 'raw' GPT-4 response:

In [23]:
generate_response("Tell me about the status of the latest Indian lunar mission.", model_name="deepseek-chat")

"As of my last update in October 2023, the latest Indian lunar mission is **Chandrayaan-3**. This mission is a follow-up to the Chandrayaan-2 mission, which had a partial success in 2019.\n\n### Chandrayaan-3 Mission Overview:\n- **Launch Date**: July 14, 2023\n- **Mission Objectives**: The primary objective of Chandrayaan-3 is to demonstrate soft landing on the Moon's surface and conduct scientific experiments.\n- **Mission Components**:\n  - **Lander (Vikram)**: Named after Dr. Vikram Sarabhai, the father of the Indian space program, the lander is designed to land on the Moon's surface.\n  - **Rover (Pragyan)**: The rover is a six-wheeled robotic vehicle that will conduct scientific experiments on the lunar surface.\n  - **Propulsion Module**: This module carried the lander and rover to the Moon's orbit and then separated from them before the landing.\n\n### Mission Status:\n- **Lunar Orbit Insertion**: The mission successfully inserted the lander and rover into the Moon's orbit on A

This response is clearly relying on out-of-date information!

## Next steps

This is clearly a very simple example. A production system inevitably comes with a bit more complexity. For one thing, using a good vector database will likely help make your RAG setup faster, more scalable and more flexible. Here are some pointers on what to consider when taking the next step:

* **Prompt engineering** - The prompt in this notebook is about as basic as it gets. Careful prompt engineering can make it possible for the LLM to directly reference specific retrieved documents, and to add hyperlinks and other interactivity into the response. You can also get the LLM to respond in structured formats like JSON, too. The [Prompting Guide](https://www.promptingguide.ai/techniques/rag) is a great resource to help get started with this.
* **Multi-index retrieval** - In this case, you used a single document index. It is not uncommon to need to use multiple document data stores. Many vector databases make it easy to achieve this.
* **Fine-tuning RAG models** - The [original RAG paper](https://arxiv.org/pdf/2005.11401.pdf) discusses how to train an embedding model (i.e. the model used for retrieval task) alongside the generation model (i.e. the language model) to perform better at certain tasks. OpenAI themselves provide a 'cookbook' notebook outlining how you can [fine tune an LLM to produce better RAG responses](https://cookbook.openai.com/examples/fine-tuned_qa/ft_retrieval_augmented_generation_qdrant).
* **Caching** - Thoughtful use of caching layers around the retriever (and possibly the generator) can improve the latency and cost efficiency of your RAG system.

If you're also a bit unfamiliar with embeddings, a great use of your time would be to learn more about how they fit into the world of modern machine learning. There's [a good introduction in the Google ML Developers course](https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture). If you're a visual learner, you may also want to check out the interactive [Embedding Projector](https://projector.tensorflow.org/) to visualise embeddings, too.