# Introduction to Retrival Augmented Generation (RAG)

RAG solution gives you an LLM that can provide the right responses to the user in different scenarios. A good RAG system should generate a grounded response based on relevant documents, but it should not do that every single time. The system also has to be able to determine whether or not any of the provided documents are relevant (and possibly decide that none are relevant), as well as decide that it can directly respond without needing any documents retrieved.

The chatbot can extract relevant information from external documents and produce verifiable, inline citations in its responses.

The diagram below provides an overview of what we’ll build.

<img src="https://github.com/cohere-ai/cohere-developer-experience/blob/main/notebooks/images/llmu/rag/rag-workflow-1.png?raw=1" alt="Workflow">

## Setup

In [47]:
import sys, os
sys.version

'3.13.3 (tags/v3.13.3:6280bb5, Apr  8 2025, 14:47:33) [MSC v.1943 64 bit (AMD64)]'

In [48]:
# install dotenv if not already installed
try:
    from dotenv import load_dotenv
except ImportError:
    os.system('pip install python-dotenv')
    from dotenv import load_dotenv
    
load_dotenv()

COHERE_API_KEY = os.getenv("COHERE_API_KEY")

In [None]:
# ! pip install cohere hnswlib unstructured -q # hnswlib for the vector library, unstructured for chunking the documents

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.6/167.6 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.8/175.8 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [49]:
import cohere
# import uuid
import hnswlib
from typing import List, Dict
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

In [None]:
#@title Enable text wrapping in Google Colab

# from IPython.display import HTML, display

# def set_css():
#   display(HTML('''
#   <style>
#     pre {
#         white-space: pre-wrap;
#     }
#   </style>
#   '''))
# get_ipython().events.register('pre_run_cell', set_css)

In [50]:


co = cohere.Client(COHERE_API_KEY) # Get your API key here: https://dashboard.cohere.com/api-keys

## Simple Example

### Define documents

We define the documents that we want to ground an LLM’s response with, formatted as a list. In our case, each document consists of two fields: title and text.


In [51]:
documents = [
    {
        "title": "Tall penguins",
        "text": "Emperor penguins are the tallest."},
    {
        "title": "Penguin habitats",
        "text": "Emperor penguins only live in Antarctica."},
    {
        "title": "What are animals?",
        "text": "Animals are different from plants."},
    {
        "title": "AI student",
        "text": "Yusuf is an AI student"},
    {
        "title": "AI student home Univerity name",
        "text": "Yusuf studies at La Sapienza Univeristy"
    },
    {
        "title": "AI student host Univerity location",
        "text": "He is currently doing an Erasmus Programme in Trento Unversity"
    },
    {
        "title": "Location of Home University and Host Univeristy",
        "text": "La Sapienza is a university located in Rome in the Lazio region of Italy. Trento university, instead, is located in Trento in Italy"
    }
]

### Generate response with citations


First, we define the user message. Then we generate the response from the LLM and display it, together with citations and the source documents used.

In [52]:
# Get the user message
message = "Hi? How are doing? What are the tallest living penguins? Who is Yusuf? Where is his host university located?"

# Generate the response
response = co.chat_stream(message=message,
                          model="command-a-03-2025",
                          documents=documents)

# Display the response
citations = []
cited_documents = []

for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)
    elif event.event_type == "stream-end":
      cited_documents = event.response.documents

# Display the citations and source documents
if citations:
  print("\n\nCITATIONS:")
  for citation in citations:
    print(citation)

  print("\nDOCUMENTS:")
  for document in cited_documents:
    print(document)

I'm doing well, thanks for asking.

The tallest living penguins are Emperor penguins.

Yusuf is an AI student who studies at La Sapienza University. His host university, Trento University, is located in Trento, Italy.

CITATIONS:
start=68 end=85 text='Emperor penguins.' document_ids=['doc_0'] type='TEXT_CONTENT'
start=99 end=109 text='AI student' document_ids=['doc_3'] type='TEXT_CONTENT'
start=125 end=148 text='La Sapienza University.' document_ids=['doc_4'] type='TEXT_CONTENT'
start=170 end=187 text='Trento University' document_ids=['doc_5'] type='TEXT_CONTENT'
start=203 end=217 text='Trento, Italy.' document_ids=['doc_6'] type='TEXT_CONTENT'

DOCUMENTS:
{'id': 'doc_0', 'text': 'Emperor penguins are the tallest.', 'title': 'Tall penguins'}
{'id': 'doc_3', 'text': 'Yusuf is an AI student', 'title': 'AI student'}
{'id': 'doc_4', 'text': 'Yusuf studies at La Sapienza Univeristy', 'title': 'AI student home Univerity name'}
{'id': 'doc_5', 'text': 'He is currently doing an Erasmus Program

## Level 2

There are three RAG modes available with the Cohere Chat endpoint:

- Document mode: Specifying the documents for the model to use when generating a response
- Connectors mode: Connecting the endpoint with an external service that handles all the logic of document retrieval
- Query-generation mode: Generating one or more queries given a user message

Note that the Document Mode also includes the Query-generation Mode.

![An overview of what we'll build](https://cohere.com/_next/image?url=https%3A%2F%2Fcohere-ai.ghost.io%2Fcontent%2Fimages%2F2024%2F04%2Frag-workflow-2.png&w=2048&q=75)

The steps to building a RAG-powered chatbot are summarized below:

0. Setup phase:
   - Step 0: Ingest the documents – get documents, chunk, embed, and index

1. For each user-chatbot interaction:
   - Step 1: Get the user message
   - Step 2: Call the Chat endpoint in query-generation mode
   - If at least one query is generated:
     - Step 3: Retrieve and rerank relevant documents
     - Step 4: Call the Chat endpoint in document mode to generate a grounded response with citations
   - If no query is generated:
     - Step 4: Call the Chat endpoint in normal mode to generate a response

   - Throughout the conversation:
     - Append the user-chatbot interaction to the conversation thread
     - Repeat with every interaction

In [20]:
# @title Defining documents

raw_documents = [
    {
        "title": "Crafting Effective Prompts",
        "url": "https://docs.cohere.com/docs/crafting-effective-prompts"},
    {
        "title": "Advanced Prompt Engineering Techniques",
        "url": "https://docs.cohere.com/docs/advanced-prompt-engineering-techniques"},
    {
        "title": "Prompt Truncation",
        "url": "https://docs.cohere.com/docs/prompt-truncation"},
    {
        "title": "Preambles",
        "url": "https://docs.cohere.com/docs/preambles"}
]


![Vectore Store](https://cohere.com/_next/image?url=https%3A%2F%2Fcohere-ai.ghost.io%2Fcontent%2Fimages%2F2024%2F04%2Frag-components-vectorstore.png&w=2048&q=75)

![The document ingestion portion of the Documents component](https://cohere.com/_next/image?url=https%3A%2F%2Fcohere-ai.ghost.io%2Fcontent%2Fimages%2F2024%2F03%2Frag-chatbot-embedding.png&w=2048&q=75)

In [40]:
# @title Vectorstore

class Vectorstore:
    """The Vectorstore class handles the ingestion of documents into embeddings (or vectors)
    and the retrieval of relevant documents given a query.
    """
    def __init__(self, raw_documents: List[Dict[str, str]]):
        self.raw_documents = raw_documents
        self.docs = [] # chunked version of the documents
        self.docs_embs = [] # embeddings of the chunked documents
        self.retrieve_top_k = 10
        self.rerank_top_k = 3
        self.load_and_chunk()
        self.embed()
        self.index()

    def load_and_chunk(self) -> None:
        """
        Loads the text from the sources and chunks the HTML content.
        """
        print("Loading documents...")

        for raw_document in self.raw_documents:
            elements = partition_html(url=raw_document["url"])
            chunks = chunk_by_title(elements)
            for chunk in chunks:
                self.docs.append(
                    {
                        "title": raw_document["title"],
                        "text": str(chunk),
                        "url": raw_document["url"],
                    }
                )

    def embed(self) -> None:
        """
        Embeds the document chunks using the Cohere API.

        With the Embed v3 model, we need to define an input_type, of which there are four options depending
        on the type of task. Using these input types ensures the highest possible quality for the respective tasks.
        Since our document chunks will be used for retrieval, we use search_document as the input_type
        """
        print("Embedding document chunks...")

        # Since the endpoint has a limit of 96 documents per call, we send them in batches.
        batch_size = 90
        self.docs_len = len(self.docs)
        for i in range(0, self.docs_len, batch_size):
            batch = self.docs[i : min(i + batch_size, self.docs_len)]
            texts = [item["text"] for item in batch]
            docs_embs_batch = co.embed(
                texts=texts, model="embed-english-v3.0", input_type="search_document"
            ).embeddings
            self.docs_embs.extend(docs_embs_batch)

    def index(self) -> None:
        """
        Indexes the documents for efficient retrieval.

        For production environments, typically a vector database (like Weaviate or MongoDB) is
        required to handle the continuous process of indexing documents and maintaining the index.

        Here, however, we’ll keep it simple and use a vector library instead.
        We can choose from many open-source projects, such as Faiss, Annoy, ScaNN,
        or Hnswlib, which is the one we’ll use.
        These libraries store embeddings in in-memory indexes and implement
        approximate nearest neighbor (ANN) algorithms to make similarity search efficient.
        """
        print("Indexing documents...")

        # ip = inner product for the similarity metric to be used
        self.idx = hnswlib.Index(space="ip", dim=1024)

        # ef_construction=512: Controls the quality and speed of index construction.
        # Higher values lead to better recall at the cost of slower indexing.
        # M=64: Determines the number of bi-directional links created for each element in the HNSW graph.
        # Larger values increase accuracy but also increase memory usage.
        self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)

        # Add the embeddings to the index with their corresponding IDs from (0 to len(docs_embs))
        self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.idx.get_current_count()} documents.")

    def retrieve(self, query: str) -> List[Dict[str, str]]:
        """Retrieves document chunks based on the given query using Semantic Search.
        It has 2 steps: Dense retrieval and Reranking.

        While our dense retrieval component is already highly capable of retrieving relevant sources,
        Cohere Rerank provides an additional boost to the quality of the search results,
        especially for complex and domain-specific queries. It takes the search results and
        sorts them according to their relevance to the query.

        Parameters:
        query (str): The query to retrieve document chunks for.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
        """

        # Dense retrieval with input_type=”search_query” for queries
        query_emb = co.embed(
            texts=[query], model="embed-english-v3.0", input_type="search_query"
        ).embeddings

        doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]
        print(f"Retrieved document IDs: {doc_ids}")

        # Reranking for additional boost in relevance
        rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

        docs_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]

        rerank_results = co.rerank(
            query=query,
            documents=docs_to_rerank,
            top_n=self.rerank_top_k,
            model="rerank-english-v3.0",
            rank_fields=rank_fields
        )

        doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]
        print(f"Rerank results: {rerank_results.results}")

        docs_retrieved = []
        for i, doc_id in enumerate(doc_ids_reranked):
            docs_retrieved.append(
                {
                    "title": self.docs[doc_id]["title"],
                    "text": self.docs[doc_id]["text"],
                    "url": self.docs[doc_id]["url"],
                    "id": int(doc_id),
                    "relevance_score": rerank_results.results[i].relevance_score,
                }
            )

        return docs_retrieved


In [41]:
# @title Process the Documents.
# In our case, we get a total of 136 documents, chunked from the four web URLs.
vectorstore = Vectorstore(raw_documents=raw_documents)

Loading documents...
Embedding document chunks...
Indexing documents...
Indexing complete with 120 documents.


In [42]:
# @title Testing Retrieval
vectorstore.retrieve("Prompting by giving examples")

Retrieved document IDs: [55 64 31  1 43 93 56 42 63 41]
Rerank results: [RerankResponseResultsItem(document=None, index=0, relevance_score=0.99554896), RerankResponseResultsItem(document=None, index=2, relevance_score=0.98835784), RerankResponseResultsItem(document=None, index=6, relevance_score=0.96182173)]


[{'title': 'Advanced Prompt Engineering Techniques',
  'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.',
  'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques',
  'id': 55,
  'relevance_score': 0.99554896},
 {'title': 'Crafting Effective Prompts',
  'text': 'Incorporating Example Outputs\n\nLLMs respond well when they have specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.',
  'url': 'https://docs.cohere.com/docs/crafting-effective-prompts',
  'id': 31,
  'relevance

In [None]:
# @title Run the Chatbot

def run_chatbot(message, chat_history=[]):
    """
    1. Give the user message to the LLM to determine if additional context is needed
    2. If so:
       - The LLM returns search queries
       - We retrieve the documents from the DB,
       - The LLM uses the documents as context and responds
    3. If not:
       - The LLM responds directly without additional context
    """

    # Generate search queries, if any
    response = co.chat(message=message,
                        model="command-a-03-2025",
                        search_queries_only=True, # Generate only search queries, not full responses
                        chat_history=chat_history)

    search_queries = []
    for query in response.search_queries:
        search_queries.append(query.text)

    # If there are search queries, retrieve the documents
    if search_queries:
        print("Retrieving information...", end="")

        # Retrieve document chunks for each query
        documents = []
        for query in search_queries:
            documents.extend(vectorstore.retrieve(query))

        # Use document chunks to respond
        response = co.chat_stream(
            message=message,
            model="command-a-03-2025",
            documents=documents,
            chat_history=chat_history,
        )

    else:
        # If no additional context is needed, respond directly
        response = co.chat_stream(
            message=message,
            model="command-a-03-2025",
            chat_history=chat_history,
        )

    # Print the chatbot response and citations
    chatbot_response = ""
    print("\nChatbot:")

    for event in response:
        if event.event_type == "text-generation":
            print(event.text, end="")
            chatbot_response += event.text
        if event.event_type == "stream-end":
            if event.response.citations:
                print("\n\nCITATIONS:")
                for citation in event.response.citations:
                    print(citation)
            if event.response.documents:
                print("\nCITED DOCUMENTS:")
                for document in event.response.documents:
                    print(document)
            # Update the chat history for the next turn
            chat_history = event.response.chat_history

    return chat_history


In [44]:
# Turn # 1
chat_history = run_chatbot("Hello, I have a question")



Chatbot:
Hello! I'm here to help. Please go ahead and ask your question, and I'll do my best to provide a helpful and informative answer.

In [45]:
# Turn # 2
chat_history = run_chatbot("What is prompt engineering?", chat_history)



Chatbot:
**Prompt engineering** is the practice of designing and optimizing input prompts to guide language models, like GPT, to generate desired outputs. It involves crafting specific, clear, and contextually rich instructions or questions to elicit accurate, relevant, or creative responses from AI systems.

Here’s a breakdown of key aspects of prompt engineering:

1. **Purpose**:  
   - To improve the quality, relevance, and specificity of AI-generated responses.  
   - To align the model's output with the user's intent or task requirements.

2. **Techniques**:  
   - **Clear Instructions**: Providing explicit directions or examples in the prompt.  
   - **Contextual Information**: Adding background details to guide the model.  
   - **Iterative Refinement**: Testing and adjusting prompts to improve results.  
   - **Role Assignment**: Framing the model as a specific "role" (e.g., "Act as a teacher").  
   - **Constraints**: Limiting the scope of the response (e.g., "Answer in 3 sen

In [46]:
# Turn # 4
chat_history = run_chatbot("What's the difference between zero-shot and few-shot prompting", chat_history)



Chatbot:
**Zero-shot** and **few-shot prompting** are techniques used in prompt engineering to guide language models, but they differ in how they leverage examples or context. Here's the breakdown:

---

### **Zero-Shot Prompting**
- **Definition**: Zero-shot prompting involves giving the model a task or question **without providing any examples** of the desired output. The model relies solely on its pre-trained knowledge to generate a response.
- **Example**:  
  Prompt: *"Translate the following English sentence into French: 'The cat is on the mat.'"*  
  The model generates the translation based on its understanding of English and French, without seeing any prior examples.
- **Use Case**: Ideal when the model is expected to generalize well to new tasks or when examples are not available.
- **Advantage**: Simple and requires no additional data.
- **Limitation**: May produce less accurate or inconsistent results for complex or ambiguous tasks.

---

### **Few-Shot Prompting**
- **Def

---
---