<a target="_blank" href="https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/llmu/co_aws_ch6_rag_bedrock_sm.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>

# Retrieval-Augmented Generation (RAG) Using Cohere on AWS

Large Language Models (LLMs) have proven effective at performing text generation tasks and maintaining the context of a conversation in a chat setting. However, at times, we can encounter a scenario where an LLM hallucinates and provides factually inaccurate responses to a given question. This is especially true in business settings, where companies have proprietary data that an LLM would not have seen during its training phase.

Retrieval-augmented generation (RAG) bridges the gap by allowing an LLM to integrate external data sources and use them in its response generation. This significantly minimizes the hallucination issue, making the model's responses more accurate and reliable.

# Setup

In [28]:
! pip install cohere cohere-aws boto3 hnswlib unstructured -q

In [2]:
import os
import cohere
import boto3
import cohere_aws
from cohere_aws import Client

First, we set up the clients for Bedrock (to be used for Chat and Embed) and SageMaker (to be used for Rerank) using the same steps as in the previous chapters. Here we name the clients co_br for Bedrock and co_sm for SageMaker.

To use Bedrock, we create a BedrockClient by passing the necessary AWS credentials.

In [42]:
# Create Bedrock client via the native Cohere SDK
# Contact your AWS administrator for the credentials

co_br = cohere.BedrockClient(
    aws_region="YOUR_AWS_REGION",
    aws_access_key="YOUR_AWS_ACCESS_KEY_ID",
    aws_secret_key="YOUR_AWS_SECRET_ACCESS_KEY",
    aws_session_token="YOUR_AWS_SESSION_TOKEN",
)

Later we'll need to create a SageMaker endpoint that exposes access to a Cohere model (Rerank v3 in our case). For this, we’ll use the cohere_aws SDK which makes it easy to set up the endpoint, together with AWS’s boto3 library.

Once the endpoint is created (as we’ll walk through later), we can access it using the cohere SDK. To do this, let’s create a SagemakerClient by passing the necessary AWS credentials.

In [None]:
# Create SageMaker client via the native Cohere SDK
# Contact your AWS administrator for the credentials

co_sm = cohere.SagemakerClient(
    aws_region="YOUR_AWS_REGION",
    aws_access_key="YOUR_AWS_ACCESS_KEY_ID",
    aws_secret_key="YOUR_AWS_SECRET_ACCESS_KEY",
    aws_session_token="YOUR_AWS_SESSION_TOKEN",
)

# For creating an endpoint, you need to use the cohere_aws client: Set environment variables with the AWS credentials
os.environ['AWS_ACCESS_KEY_ID'] = "YOUR_AWS_ACCESS_KEY_ID"
os.environ['AWS_SECRET_ACCESS_KEY'] = "YOUR_AWS_SECRET_ACCESS_KEY"
os.environ['AWS_SESSION_TOKEN'] = "YOUR_AWS_SESSION_TOKEN"

# Create SageMaker Endpoint

The next step is to create a Rerank SageMaker endpoint by defining the model package Amazon Resource Names (ARN) for the Rerank model. The ARN is an identifying string for a SageMaker resource, and it varies between the regions where a resource is available.

Here, we define the Cohere package for the Rerank model and map the model package against each region, which gives the complete ARN for each region.

In [43]:
# Create SageMaker endpoint via the cohere_aws SDK

cohere_package = "cohere-rerank-english-v3-01-d3687e0d2e3a366bb904275616424807"
model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}",
    "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}",
    "us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}",
    "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}",
    "ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}",
    "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}",
    "eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}",
    "eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}",
    "eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}",
    "eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}",
    "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}",
    "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}",
    "ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}",
    "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}",
    "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}",
    "sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}",
}

region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise Exception("UNSUPPORTED REGION")

model_package_arn = model_package_map[region]

co_aws = Client(region_name=region)

co_aws.create_endpoint(arn=model_package_arn, endpoint_name="my-rerank-v3", instance_type="ml.g5.xlarge", n_instances=1)

# Quick Example

We’ll start with a quick example to understand the key aspects of RAG.

With RAG, the first step is to define the documents that an LLM will have access to. Here, we have a short list of simple documents. Typically, there is a retrieval process to retrieve the most relevant documents based on a user query, which we’ll cover in the longer walkthrough next. But at this point, let’s assume that these are the only documents and we’ll pass all of them to the LLM.

In [9]:
documents = [
    {
        "title": "Tall penguins",
        "text": "Emperor penguins are the tallest."},
    {
        "title": "Penguin habitats",
        "text": "Emperor penguins only live in Antarctica."},
    {
        "title": "What are animals?",
        "text": "Animals are different from plants."}
]

We have seen how to use the Chat endpoint in the text generation chapter. To use the RAG feature, we simply need to add one additional parameter, documents, to the endpoint call. These are the documents we have defined earlier, which are now available to the model for it to consider utilizing in its response.

Let’s now see how the model responds when given the user message, "What are the tallest living penguins?"

In [14]:
message = "What are the tallest living penguins?"

response = co_br.chat(message=message,
                   documents=documents,
                   model="cohere.command-r-plus-v1:0")

print("\nRESPONSE:\n")
print(response.text)
    
if response.citations:
    print("\nCITATIONS:\n")           
    for citation in response.citations:
        print(citation)


RESPONSE:

The tallest living penguins are the Emperor penguins. These penguins only live in Antarctica.

CITATIONS:

start=4 end=53 text='tallest living penguins are the Emperor penguins.' document_ids=['doc_0']
start=69 end=93 text='only live in Antarctica.' document_ids=['doc_1']


In the response, the model used the documents to inform its answer to the question. For example, the tallest living penguins are emperor penguins part of its response was cited from doc_0, which is the first document in the list containing the text Emperor penguins are the tallest.

# A More Comprehensive Example

Now that we’ve covered the basics, let’s look at a more comprehensive example of RAG that includes:

- Building a retrieval system that includes turning documents into text embeddings and storing them in an index
- Building a query generation system that turns user messages into optimized queries for retrieval
- Wrapping a user interaction with an LLM in a chat interface
- Building a response generation system that’s able to answer different types of queries, such as those that require and don’t require RAG


First, let’s import the necessary libraries for this project. This includes hnswlib for the vector library and unstructured for chunking the documents (more details on these later).

In [15]:
import uuid
import hnswlib
from typing import List, Dict
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

First, we’ll define the documents we’ll use for RAG. We’ll use a few pages from the Cohere documentation that discuss prompt engineering, each in the Python list raw_documents below. Each entry is identified by its title and URL.

# Define Documents

In [16]:
raw_documents = [
    {
        "title": "Crafting Effective Prompts",
        "url": "https://docs.cohere.com/docs/crafting-effective-prompts"},
    {
        "title": "Advanced Prompt Engineering Techniques",
        "url": "https://docs.cohere.com/docs/advanced-prompt-engineering-techniques"},
    {
        "title": "Prompt Truncation",
        "url": "https://docs.cohere.com/docs/prompt-truncation"},
    {
        "title": "Preambles",
        "url": "https://docs.cohere.com/docs/preambles"}
]

# Create Vectorstore

The Vectorstore class handles the ingestion of documents into embeddings (or vectors) and the retrieval of relevant documents given a query.

It includes a few methods:

- load_and_chunk: Loads the raw documents from the URL and breaks them into smaller chunks. We’ll utilize the partition_html method from the unstructured library to perform the chunking.
- embed: Generates embeddings of the chunked documents. We use the Embed endpoint available on Bedrock, which uses the cohere.embed-english-v3 model.
- index: Indexes the document chunk embeddings to ensure efficient similarity search during retrieval. For this, we’ll use the hnswlib vector library.
- retrieve: Uses semantic search to retrieve relevant document chunks from the index, given a query. It involves two steps: first, dense retrieval from the index via the Embed endpoint, and second, a reranking via the Rerank endpoint to boost the search results further.

In [17]:
class Vectorstore:

    def __init__(self, raw_documents: List[Dict[str, str]]):
        self.raw_documents = raw_documents
        self.docs = []
        self.docs_embs = []
        self.retrieve_top_k = 10
        self.rerank_top_k = 3
        self.load_and_chunk()
        self.embed()
        self.index()


    def load_and_chunk(self) -> None:
        """
        Loads the text from the sources and chunks the HTML content.
        """
        print("Loading documents...")

        for raw_document in self.raw_documents:
            elements = partition_html(url=raw_document["url"])
            chunks = chunk_by_title(elements)
            for chunk in chunks:
                self.docs.append(
                    {
                        "title": raw_document["title"],
                        "text": str(chunk),
                        "url": raw_document["url"],
                    }
                )

    def embed(self) -> None:
        """
        Embeds the document chunks using the Cohere API.
        """
        print("Embedding document chunks...")

        batch_size = 90
        self.docs_len = len(self.docs)
        for i in range(0, self.docs_len, batch_size):
            batch = self.docs[i : min(i + batch_size, self.docs_len)]
            texts = [item["text"] for item in batch]
            docs_embs_batch = co_br.embed(
                                texts=texts,
                                model="cohere.embed-english-v3",
                                input_type="search_document"
            ).embeddings
            self.docs_embs.extend(docs_embs_batch)

    def index(self) -> None:
        """
        Indexes the document chunks for efficient retrieval.
        """
        print("Indexing document chunks...")

        self.idx = hnswlib.Index(space="ip", dim=1024)
        self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
        self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.idx.get_current_count()} document chunks.")

    def retrieve(self, query: str) -> List[Dict[str, str]]:
        """
        Retrieves document chunks based on the given query.

        Parameters:
        query (str): The query to retrieve document chunks for.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
        """

        # Dense retrieval
        query_emb = co_br.embed(
                        texts=[query],
                        model="cohere.embed-english-v3",
                        input_type="search_query"
        ).embeddings
        
        doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]

        # Reranking
        rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

        docs_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]
        rerank_results = co_sm.rerank(
                            query=query,
                            documents=docs_to_rerank,
                            top_n=self.rerank_top_k,
                            rank_fields=rank_fields,
                            model="my-rerank-v3")

        doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]

        docs_retrieved = []
        for doc_id in doc_ids_reranked:
            docs_retrieved.append(
                {
                    "title": self.docs[doc_id]["title"],
                    "text": self.docs[doc_id]["text"],
                    "url": self.docs[doc_id]["url"],
                }
            )

        return docs_retrieved

# Process Documents

Now that the Vectorstore component is set up, we can process the documents, which will involve chunking, embedding, and indexing. We do this by creating an instance of the Vectorstore and passing the raw documents we defined earlier.

In [18]:
# Create an instance of the Vectorstore class with the given sources
vectorstore = Vectorstore(raw_documents)

Loading documents...
Embedding document chunks...
Indexing document chunks...
Indexing complete with 44 document chunks.


We can test if the retrieval is working by entering a search query.

In [19]:
vectorstore.retrieve("Prompting by giving examples")

[{'title': 'Advanced Prompt Engineering Techniques',
  'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.',
  'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'},
 {'title': 'Crafting Effective Prompts',
  'text': 'Incorporating Example Outputs\n\nLLMs respond well when they have specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.',
  'url': 'https://docs.cohere.com/docs/crafting-effective-prompts'},
 {'title': 'Advanced Prompt Engineering Techniques',
  'text': 'In a

# Run Chatbot

We can now run the chatbot. For this, we create a generate_chat function which includes the RAG components:
- For each user message, we use the endpoint’s search query generation feature to turn the message into one or more queries that are optimized for retrieval. The endpoint can even return no query, which means that a user message can be responded to directly without retrieval. This is done by calling the Chat endpoint with the search_queries_only parameter and setting it as True.
- If there is no search query generated, we call the Chat endpoint to generate a response directly. If there is at least one, we call the retrieve method from the Vectorstore instance to retrieve the most relevant documents to each query.
- Finally, all the results from all queries are appended to a list and passed to the Chat endpoint for response generation.
- We print the response, together with the citations and the list of document chunks cited, for easy reference.

In [26]:
def run_chatbot(message, chat_history=None):
    
    if chat_history is None:
        chat_history = []
    
    # Generate search queries, if any        
    response = co_br.chat(message=message,
                            search_queries_only=True,
                            model="cohere.command-r-plus-v1:0",
                            chat_history=chat_history)
    
    search_queries = []
    for query in response.search_queries:
        search_queries.append(query.text)

    # If there are search queries, retrieve the documents
    if search_queries:
        print("Retrieving information...", end="")

        # Retrieve document chunks for each query
        documents = []
        for query in search_queries:
            documents.extend(vectorstore.retrieve(query))

        # Use document chunks to respond
        response = co_br.chat(
            message=message,
            model="cohere.command-r-plus-v1:0",
            documents=documents,
            chat_history=chat_history)

    else:
        response = co_br.chat(
            message=message,
            model="cohere.command-r-plus-v1:0",
            chat_history=chat_history)
        
    # Print the chatbot response, citations, and documents
    
    print("\nRESPONSE:\n")
    print(response.text)
        
    if response.citations:
        print("\nCITATIONS:\n")           
        for citation in response.citations:
            print(citation)
        print("\nDOCUMENTS:\n")           
        for document in response.documents:
            print(document)
            
    chat_history = response.chat_history

    return chat_history
                

Here is a sample conversation consisting of a few turns. 

In [27]:
chat_history = run_chatbot("Hello, I have a question")


RESPONSE:

Of course! I am here to help. Please go ahead and ask your question, and I will do my best to provide a helpful response.


In [28]:
chat_history = run_chatbot("What's the difference between zero-shot and few-shot prompting", chat_history)

Retrieving information...
RESPONSE:

Zero-shot prompting is when no examples of the task are provided to the model. On the other hand, few-shot prompting is a technique where a model is given a few examples of the task being performed before asking the specific question to be answered.

CITATIONS:

start=0 end=19 text='Zero-shot prompting' document_ids=['doc_0']
start=28 end=78 text='no examples of the task are provided to the model.' document_ids=['doc_0']
start=98 end=116 text='few-shot prompting' document_ids=['doc_0']
start=140 end=197 text='model is given a few examples of the task being performed' document_ids=['doc_0']
start=205 end=249 text='asking the specific question to be answered.' document_ids=['doc_0']

DOCUMENTS:

{'id': 'doc_0', 'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM towar

In [29]:
chat_history = run_chatbot("How would the latter help?", chat_history)

Retrieving information...
RESPONSE:

Few-shot prompting can vastly improve the quality of the model's completions. By providing a few relevant and diverse examples, the model can be steered toward a high-quality solution. These examples condition the model to the expected response type and style.

CITATIONS:

start=23 end=77 text="vastly improve the quality of the model's completions." document_ids=['doc_2']
start=97 end=126 text='relevant and diverse examples' document_ids=['doc_0']
start=145 end=184 text='steered toward a high-quality solution.' document_ids=['doc_0']
start=200 end=260 text='condition the model to the expected response type and style.' document_ids=['doc_0']

DOCUMENTS:

{'id': 'doc_2', 'text': 'Advanced Prompt Engineering Techniques\n\nSuggest Edits\n\nThe previous chapter discussed general rules and heuristics to follow for successfully prompting the Command family of models. Here, we will discuss specific advanced prompt engineering techniques that can in many cas

In [30]:
chat_history = run_chatbot("What do you know about 5G networks?", chat_history)

Retrieving information...
RESPONSE:

Sorry, I don't have any information about 5G networks. Is there anything else you would like to ask?


There are a few observations worth pointing out:

- Direct response: For user messages that don’t require retrieval (“Hello, I have a question”), the chatbot responds directly without requiring retrieval.
- Citation generation: For responses that do require retrieval ("What's the difference between zero-shot and few-shot prompting"), the endpoint returns the response together with the citations. These are fine-grained citations, which means they refer to specific spans of the generated text.
- State management: The endpoint maintains the state of the conversation via the chat_history parameter, for example, by correctly responding to a vague user message such as "How would the latter help?"
- Response synthesis: The model can decide if none of the retrieved documents provide the necessary information to answer a user message. For example, when asked the question, “What do you know about 5G networks”, the chatbot retrieves external information from the index. However, it doesn’t use any of the information in its response as none of it is relevant to the question.

Here are the contents of the chat history.

In [31]:
print("Chat history:")
for c in chat_history:
    print(c, "\n")
print("="*50)

Chat history:
message='Hello, I have a question' tool_calls=None role='USER' 

message='Of course! I am here to help. Please go ahead and ask your question, and I will do my best to provide a helpful response.' tool_calls=None role='CHATBOT' 

message="What's the difference between zero-shot and few-shot prompting" tool_calls=None role='USER' 

message='Zero-shot prompting is when no examples of the task are provided to the model. On the other hand, few-shot prompting is a technique where a model is given a few examples of the task being performed before asking the specific question to be answered.' tool_calls=None role='CHATBOT' 

message='How would the latter help?' tool_calls=None role='USER' 

message="Few-shot prompting can vastly improve the quality of the model's completions. By providing a few relevant and diverse examples, the model can be steered toward a high-quality solution. These examples condition the model to the expected response type and style." tool_calls=None role='

This notebook demonstrated how to create a RAG application using Cohere Chat and Embed on Amazon Bedrock and Cohere Rerank on Amazon SageMaker. RAG enhances LLMs by enabling them to integrate external data sources and reduce hallucination, resulting in more accurate and reliable responses.
