# Retrieval Augmented Generation (RAG)

Large Language Models (LLMs) excel at generating text and maintaining conversational context in chat applications. However, LLMs can sometimes hallucinate - producing responses that are factually incorrect. This is particularly important to mitigate in enterprise environments where organizations work with proprietary information that wasn't part of the model's training data.

Retrieval-augmented generation (RAG) addresses this limitation by enabling LLMs to incorporate external knowledge sources into their response generation process. By grounding responses in retrieved facts, RAG significantly reduces hallucinations and improves the accuracy and reliability of the model's outputs.

In this tutorial, we'll cover:
- Setting up the Cohere client
- Building a RAG application by combining retrieval and chat capabilities
- Managing chat history and maintaining conversational context
- Handling direct responses vs responses requiring retrieval
- Generating citations for retrieved information

In the next tutorial, we'll explore how to leverage Cohere's tool use features to build agentic applications.

We'll use Cohere's Command, Embed, and Rerank models deployed on Azure.

## Setup

First, you will need to deploy the Command, Embed, and Rerank models on Azure via Azure AI Foundry. The deployment will create a serverless API with pay-as-you-go token based billing. You can find more information on how to deploy models in the [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-serverless?tabs=azure-ai-studio).

Once the model is deployed, you can access it via Cohere's Python SDK. Let's now install the Cohere SDK and set up our client.

To create a client, you need to provide the API key and the model's base URL for the Azure endpoint. You can get these information from the Azure AI Foundry platform where you deployed the model.

In [1]:
# ! pip install cohere hnswlib unstructured

import cohere

co_chat = cohere.Client(
  api_key="AZURE_API_KEY_CHAT",
  base_url="AZURE_ENDPOINT_CHAT" # example: "https://cohere-command-r-plus-08-2024-xyz.eastus.models.ai.azure.com/"
)

co_embed = cohere.Client(
  api_key="AZURE_API_KEY_EMBED",
  base_url="AZURE_ENDPOINT_EMBED" # example: "https://cohere-embed-v3-multilingual-xyz.eastus.models.ai.azure.com/"
)

co_rerank = cohere.Client(
    api_key="AZURE_API_KEY_RERANK",
    base_url="AZURE_ENDPOINT_RERANK" # example: "https://cohere-rerank-v3-multilingual-xyz.eastus.models.ai.azure.com/"
)

## A quick example

Let's begin with a simple example to explore how RAG works.
 
The foundation of RAG is having a set of documents for the LLM to reference. Below, we'll work with a small collection of basic documents. While RAG systems usually involve retrieving relevant documents based on the user's query (which we'll explore later), for now we'll keep it simple and use this entire small set of documents as context for the LLM.

We have seen how to use the Chat endpoint in the text generation chapter. To use the RAG feature, we simply need to add one additional parameter, `documents`, to the endpoint call. These are the documents we want to provide as the context for the model to use in its response.

In [8]:
documents = [
    {
        "title": "Tall penguins",
        "text": "Emperor penguins are the tallest."},
    {
        "title": "Penguin habitats",
        "text": "Emperor penguins only live in Antarctica."},
    {
        "title": "What are animals?",
        "text": "Animals are different from plants."}
]

Let's see how the model responds to the question "What are the tallest living penguins?"

The model leverages the provided documents as context for its response. Specifically, when mentioning that Emperor penguins are the tallest species, it references `doc_0` - the document which states that "Emperor penguins are the tallest."

In [9]:
message = "What are the tallest living penguins?"

response = co_chat.chat(
    message=message, 
    documents=documents
)

print("\nRESPONSE:\n")
print(response.text)

if response.citations:
    print("\nCITATIONS:\n")
    for citation in response.citations:
        print(citation)


RESPONSE:

The tallest living penguins are the Emperor penguins. They only live in Antarctica.

CITATIONS:

start=36 end=53 text='Emperor penguins.' document_ids=['doc_0']
start=59 end=83 text='only live in Antarctica.' document_ids=['doc_1']


## A more comprehensive example

Now that we’ve covered a basic RAG implementation, let’s look at a more comprehensive example of RAG that includes:

- Creating a retrieval system that converts documents into text embeddings and stores them in an index
- Building a query generation system that transforms user messages into optimized search queries
- Implementing a chat interface to handle LLM interactions with users
- Designing a response generation system capable of handling various query types

First, let’s import the necessary libraries for this project. This includes `hnswlib` for the vector library and `unstructured` for chunking the documents (more details on these later).


In [19]:
import uuid
import yaml
import hnswlib
from typing import List, Dict
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

## Define documents

Next, we’ll define the documents we’ll use for RAG. We’ll use a few pages from the Cohere documentation that discuss prompt engineering. Each entry is identified by its title and URL.

In [20]:
raw_documents = [
    {
        "title": "Crafting Effective Prompts",
        "url": "https://docs.cohere.com/docs/crafting-effective-prompts"},
    {
        "title": "Advanced Prompt Engineering Techniques",
        "url": "https://docs.cohere.com/docs/advanced-prompt-engineering-techniques"},
    {
        "title": "Prompt Truncation",
        "url": "https://docs.cohere.com/docs/prompt-truncation"},
    {
        "title": "Preambles",
        "url": "https://docs.cohere.com/docs/preambles"}
]

## Create vectorstore

The Vectorstore class handles the ingestion of documents into embeddings (or vectors) and the retrieval of relevant documents given a query.

It includes a few methods:

- `load_and_chunk`: Loads the raw documents from the URL and breaks them into smaller chunks
- `embed`: Generates embeddings of the chunked documents
- `index`: Indexes the document chunk embeddings to ensure efficient similarity search during retrieval
- `retrieve`: Uses semantic search to retrieve relevant document chunks from the index, given a query. It involves two steps: first, dense retrieval from the index via the Embed endpoint, and second, a reranking via the Rerank endpoint to boost the search results further.

In [21]:
class Vectorstore:

    def __init__(self, raw_documents: List[Dict[str, str]]):
        self.raw_documents = raw_documents
        self.docs = []
        self.docs_embs = []
        self.retrieve_top_k = 10
        self.rerank_top_k = 3
        self.load_and_chunk()
        self.embed()
        self.index()


    def load_and_chunk(self) -> None:
        """
        Loads the text from the sources and chunks the HTML content.
        """
        print("Loading documents...")

        for raw_document in self.raw_documents:
            elements = partition_html(url=raw_document["url"])
            chunks = chunk_by_title(elements)
            for chunk in chunks:
                self.docs.append(
                    {
                        "title": raw_document["title"],
                        "text": str(chunk),
                        "url": raw_document["url"],
                    }
                )

    def embed(self) -> None:
        """
        Embeds the document chunks using the Cohere API.
        """
        print("Embedding document chunks...")

        batch_size = 90
        self.docs_len = len(self.docs)
        for i in range(0, self.docs_len, batch_size):
            batch = self.docs[i : min(i + batch_size, self.docs_len)]
            texts = [item["text"] for item in batch]
            docs_embs_batch = co_embed.embed(
                texts=texts,
                input_type="search_document"
            ).embeddings
            self.docs_embs.extend(docs_embs_batch)

    def index(self) -> None:
        """
        Indexes the document chunks for efficient retrieval.
        """
        print("Indexing document chunks...")

        self.idx = hnswlib.Index(space="ip", dim=1024)
        self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
        self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.idx.get_current_count()} document chunks.")

    def retrieve(self, query: str) -> List[Dict[str, str]]:
        """
        Retrieves document chunks based on the given query.

        Parameters:
        query (str): The query to retrieve document chunks for.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
        """

        # Dense retrieval
        query_emb = co_embed.embed(
            texts=[query],
            input_type="search_query"
        ).embeddings
        
        doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]

        # Reranking
        docs_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]
        yaml_docs = [yaml.dump(doc, sort_keys=False) for doc in docs_to_rerank] 
        rerank_results = co_rerank.rerank(
            query=query,
            documents=yaml_docs,
            top_n=self.rerank_top_k
        )
        doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]

        docs_retrieved = []
        for doc_id in doc_ids_reranked:
            docs_retrieved.append(
                {
                    "title": self.docs[doc_id]["title"],
                    "text": self.docs[doc_id]["text"],
                    "url": self.docs[doc_id]["url"],
                }
            )

        return docs_retrieved

## Process documents

With the Vectorstore set up, we can process the documents, which will involve chunking, embedding, and indexing.

In [22]:
# Create an instance of the Vectorstore class with the given sources
vectorstore = Vectorstore(raw_documents)

Loading documents...
Embedding document chunks...
Indexing document chunks...
Indexing complete with 137 document chunks.


We can test if the retrieval is working by entering a search query.

In [23]:
vectorstore.retrieve("Prompting by giving examples")

[{'title': 'Advanced Prompt Engineering Techniques',
  'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.',
  'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'},
 {'title': 'Crafting Effective Prompts',
  'text': 'Incorporating Example Outputs\n\nLLMs respond well when they have specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.',
  'url': 'https://docs.cohere.com/docs/crafting-effective-prompts'},
 {'title': 'Advanced Prompt Engineering Techniques',
  'text': 'In a

## Run chatbot

We can now run the chatbot. For this, we create a `run_chatbot` function that accepts the user message and the history of the conversation, if any.

Here's what happens inside the function:
- For each user message, we use the Chat endpoint’s search query generation feature to turn the user message into one or more queries that are optimized for retrieval. The endpoint can even return no query, meaning a user message can be responded to directly without retrieval. This is done by calling the Chat endpoint with the `search_queries_only` parameter and setting it as `True`.
- If no search query is generated, we call the Chat endpoint to generate a response directly. If there is at least one, we call the `retrieve` method from the `Vectorstore` instance to retrieve the most relevant documents to each query.
- Finally, all the results from all queries are appended to a list and passed to the Chat endpoint for response generation.
- We print the response, together with the citations and the list of document chunks cited, for easy reference.


In [24]:
def run_chatbot(message, chat_history=None):
        
    if chat_history is None:
        chat_history = []
    
    # Generate search queries, if any        
    response = co_chat.chat(
        message=message,
        search_queries_only=True,
        chat_history=chat_history,
    )
    
    search_queries = []
    for query in response.search_queries:
        search_queries.append(query.text)

    # If there are search queries, retrieve the documents
    if search_queries:
        print("Retrieving information...", end="")

        # Retrieve document chunks for each query
        documents = []
        for query in search_queries:
            documents.extend(vectorstore.retrieve(query))

        # Use document chunks to respond
        response = co_chat.chat(
            message=message,
            documents=documents,
            chat_history=chat_history
        )

    else:
        response = co_chat.chat(
            message=message,
            chat_history=chat_history
        )
        
    # Print the chatbot response, citations, and documents
    print("\nRESPONSE:\n")
    print(response.text)
        
    if response.citations:
        print("\nCITATIONS:\n")           
        for citation in response.citations:
            print(citation)
        print("\nDOCUMENTS:\n")           
        for document in response.documents:
            print(document)
            
    chat_history = response.chat_history

    return chat_history
                

Here is a sample conversation consisting of a few turns.

In [25]:
chat_history = run_chatbot("Hello, I have a question")


RESPONSE:

Hello! I'm Command, an AI assistant chatbot trained to assist human users by providing thorough responses. How can I help you today?


In [26]:
chat_history = run_chatbot("What's the difference between zero-shot and few-shot prompting", chat_history)

Retrieving information...
RESPONSE:

Zero-shot prompting is a technique that does not provide a model with examples of the task being performed before asking the specific question to be answered. On the other hand, few-shot prompting provides a model with examples of the task being performed before asking the specific question to be answered.

CITATIONS:

start=40 end=158 text='does not provide a model with examples of the task being performed before asking the specific question to be answered.' document_ids=['doc_0', 'doc_3']
start=197 end=307 text='provides a model with examples of the task being performed before asking the specific question to be answered.' document_ids=['doc_0', 'doc_3']

DOCUMENTS:

{'id': 'doc_0', 'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution 

In [27]:
chat_history = run_chatbot("What do you know about 5G networks?", chat_history)

Retrieving information...
RESPONSE:

I'm sorry, I couldn't find any information about 5G networks. Is there anything else I can help you with?


In [28]:
print("Chat history:")
for c in chat_history:
    print(c, "\n")
print("="*50)

Chat history:
role='USER' message='Hello, I have a question' tool_calls=None 

role='CHATBOT' message="Hello! I'm Command, an AI assistant chatbot trained to assist human users by providing thorough responses. How can I help you today?" tool_calls=None 

role='USER' message="What's the difference between zero-shot and few-shot prompting" tool_calls=None 

role='CHATBOT' message='Zero-shot prompting is a technique that does not provide a model with examples of the task being performed before asking the specific question to be answered. On the other hand, few-shot prompting provides a model with examples of the task being performed before asking the specific question to be answered.' tool_calls=None 

role='USER' message='What do you know about 5G networks?' tool_calls=None 

role='CHATBOT' message="I'm sorry, I couldn't find any information about 5G networks. Is there anything else I can help you with?" tool_calls=None 



There are a few observations worth pointing out:

- Direct response: For user messages that don’t require retrieval (“Hello, I have a question”), the chatbot responds directly without requiring retrieval.
- Citation generation: For responses that do require retrieval ("What's the difference between zero-shot and few-shot prompting"), the endpoint returns the response together with the citations. These are fine-grained citations, which means they refer to specific spans of the generated text.
- State management: The endpoint maintains the state of the conversation via the `chat_history` parameter, for example, by correctly responding to a vague user message, such as "How would the latter help?"
- Response synthesis: The model can decide if none of the retrieved documents provide the necessary information to answer a user message. For example, when asked the question, “What do you know about 5G networks”, the chatbot retrieves external information from the index. However, it doesn’t use any of the information in its response as none of it is relevant to the question.


## Conclusion

In this tutorial, we learned about:
- How to set up the Cohere client to use the Command model deployed on Azure AI Foundry for chat
- How to build a RAG application by combining retrieval and chat capabilities
- How to manage chat history and maintain conversational context
- How to handle direct responses vs responses requiring retrieval
- How citations are automatically generated for retrieved information

In the next tutorial, we'll explore how to leverage Cohere's tool use features to build agentic applications.