<a target="_blank" href="https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/llmu/RAG_with_Chat_Embed_and_Rerank.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# RAG with Chat, Embed, and Rerank

This notebook shows how to build a RAG-powered chatbot with Cohere's Chat endpoint.  The chatbot can extract relevant information from external documents and produce verifiable, inline citations in its responses.

Read the accompanying [article here](https://txt.cohere.com/rag-chatbot/).

This application will use several Cohere API endpoints:

- Chat: For handling the main logic of the chatbot, including turning a user message into queries, generating responses, and producing citations
- Embed: For turning textual documents into their embeddings representation, later to be used in retrieval (we’ll use the latest, state-of-the-art Embed v3 model)
- Rerank: For reranking the retrieved documents according to their relevance to a query

The diagram below provides an overview of what we’ll build.

![Workflow](../images/llmu/rag/rag-workflow-2.png)

Here is a summary of the steps involved.

Initial phase:
- **Step 0**: Ingest the documents – get documents, chunk, embed, and index.

For each user-chatbot interaction:
- **Step 1**: Get the user message
- **Step 2**: Call the Chat endpoint in query-generation mode
- If at least one query is generated
  - **Step 3**: Retrieve and rerank relevant documents
  - **Step 4**: Call the Chat endpoint in document mode to generate a grounded response with citations
- If no query is generated
  - **Step 4**: Call the Chat endpoint in normal mode to generate a response

# Setup

In [None]:
! pip install cohere hnswlib unstructured --upgrade nltk -q

In [1]:
import cohere
import uuid
import hnswlib
from typing import List, Dict
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

co = cohere.Client("COHERE_API_KEY") # Get your API key here: https://dashboard.cohere.com/api-keys

In [3]:
#@title Enable text wrapping in Google Colab

from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# Create a vector store for ingestion and retrieval


![RAG components - Vectorstore](../images/llmu/rag/rag-components-vectorstore.png)


First, we define the list of documents we want to ingest and make available for retrieval. As an example, we'll use the contents from the first module of Cohere's *LLM University: What are Large Language Models?*.

In [3]:
raw_documents = [
    {
        "title": "Crafting Effective Prompts",
        "url": "https://docs.cohere.com/docs/crafting-effective-prompts"},
    {
        "title": "Advanced Prompt Engineering Techniques",
        "url": "https://docs.cohere.com/docs/advanced-prompt-engineering-techniques"},
    {
        "title": "Prompt Truncation",
        "url": "https://docs.cohere.com/docs/prompt-truncation"},
    {
        "title": "Preambles",
        "url": "https://docs.cohere.com/docs/preambles"}
]

Usually the number of documents for practical applications is vast, and so we'll need to be able to search documents efficiently.  This involves breaking the documents into chunks, generating embeddings, and indexing the embeddings, as shown in the image below.  

We implement this in the `Vectorstore` class below, which takes the `raw_documents` list as input.  Three methods are immediately called when creating an object of the `Vectorstore` class:


`load_and_chunk()`  
This method uses the `partition_html()` method from the `unstructured` library to load the documents from URL and break them into smaller chunks.  Each chunk is turned into a dictionary object with three fields:
- `title` - the web page’s title,
- `text` - the textual content of the chunk, and
- `url` - the web page’s URL.  
  
  
`embed()`  
This method uses Cohere's `embed-english-v3.0` model to generate embeddings of the chunked documents.  Since our documents will be used for retrieval, we set `input_type="search_document"`.  We send the documents to the Embed endpoint in batches, because the endpoint has a limit of 96 documents per call.

`index()`  
This method uses the `hsnwlib` package to index the document chunk embeddings.  This will ensure efficient similarity search during retrieval.  Note that `hnswlib` uses a vector library, and we have chosen it for its simplicity.

In [5]:
class Vectorstore:
    """
    A class representing a collection of documents indexed into a vectorstore.

    Parameters:
    raw_documents (list): A list of dictionaries representing the sources of the raw documents. Each dictionary should have 'title' and 'url' keys.

    Attributes:
    raw_documents (list): A list of dictionaries representing the raw documents.
    docs (list): A list of dictionaries representing the chunked documents, with 'title', 'text', and 'url' keys.
    docs_embs (list): A list of the associated embeddings for the document chunks.
    docs_len (int): The number of document chunks in the collection.
    idx (hnswlib.Index): The index used for document retrieval.

    Methods:
    load_and_chunk(): Loads the data from the sources and partitions the HTML content into chunks.
    embed(): Embeds the document chunks using the Cohere API.
    index(): Indexes the document chunks for efficient retrieval.
    retrieve(): Retrieves document chunks based on the given query.
    """

    def __init__(self, raw_documents: List[Dict[str, str]]):
        self.raw_documents = raw_documents
        self.docs = []
        self.docs_embs = []
        self.retrieve_top_k = 10
        self.rerank_top_k = 3
        self.load_and_chunk()
        self.embed()
        self.index()


    def load_and_chunk(self) -> None:
        """
        Loads the text from the sources and chunks the HTML content.
        """
        print("Loading documents...")

        for raw_document in self.raw_documents:
            elements = partition_html(url=raw_document["url"])
            chunks = chunk_by_title(elements)
            for chunk in chunks:
                self.docs.append(
                    {
                        "title": raw_document["title"],
                        "text": str(chunk),
                        "url": raw_document["url"],
                    }
                )

    def embed(self) -> None:
        """
        Embeds the document chunks using the Cohere API.
        """
        print("Embedding document chunks...")

        batch_size = 90
        self.docs_len = len(self.docs)
        for i in range(0, self.docs_len, batch_size):
            batch = self.docs[i : min(i + batch_size, self.docs_len)]
            texts = [item["text"] for item in batch]
            docs_embs_batch = co.embed(
                texts=texts, model="embed-english-v3.0", input_type="search_document"
            ).embeddings
            self.docs_embs.extend(docs_embs_batch)

    def index(self) -> None:
        """
        Indexes the document chunks for efficient retrieval.
        """
        print("Indexing document chunks...")

        self.idx = hnswlib.Index(space="ip", dim=1024)
        self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
        self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.idx.get_current_count()} document chunks.")

    def retrieve(self, query: str) -> List[Dict[str, str]]:
        """
        Retrieves document chunks based on the given query.

        Parameters:
        query (str): The query to retrieve document chunks for.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
        """

        # Dense retrieval
        query_emb = co.embed(
            texts=[query], model="embed-english-v3.0", input_type="search_query"
        ).embeddings
        
        doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]

        # Reranking
        rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

        docs_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]
        rerank_results = co.rerank(
            query=query,
            documents=docs_to_rerank,
            top_n=self.rerank_top_k,
            model="rerank-english-v3.0",
            rank_fields=rank_fields
        )

        doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]

        docs_retrieved = []
        for doc_id in doc_ids_reranked:
            docs_retrieved.append(
                {
                    "title": self.docs[doc_id]["title"],
                    "text": self.docs[doc_id]["text"],
                    "url": self.docs[doc_id]["url"],
                }
            )

        return docs_retrieved

In the code cell below, we initialize an instance of the `Vectorstore` class and pass in the `raw_documents` list as input.

In [6]:
# Create an instance of the Vectorstore class with the given sources
vectorstore = Vectorstore(raw_documents)

Loading documents...
Embedding document chunks...
Indexing document chunks...
Indexing complete with 44 document chunks.


The `Vectorstore` class also has a `retrieve()` method, which we'll use to retrieve relevant document chunks given a query (as in Step 3 in the diagram shared at the beginning of this notebook).  This method has two components: (1) dense retrieval, and (2) reranking.

### Dense retrieval

First, we embed the query using the same `embed-english-v3.0` model we used to embed the document chunks, but this time we set `input_type="search_query"`.

Search is performed by the `knn_query()` method from the `hnswlib` library. Given a query, it returns the document chunks most similar to the query. We can define the number of document chunks to return using the attribute `self.retrieve_top_k=10`.

### Reranking

After semantic search, we implement a reranking step.  While our semantic search component is already highly capable of retrieving relevant sources, the [Rerank endpoint](https://cohere.com/rerank) provides an additional boost to the quality of the search results, especially for complex and domain-specific queries. It takes the search results and sorts them according to their relevance to the query.

We call the Rerank endpoint with the `co.rerank()` method and define the number of top reranked document chunks to retrieve using the attribute `self.rerank_top_k=3`.  The model we use is `rerank-english-v2.0`.  

This method returns the top retrieved document chunks `chunks_retrieved` so that they can be passed to the chatbot.

In the code cell below, we check the document chunks that are retrieved for the query `"multi-head attention definition"`.

## Test Retrieval

In [7]:
vectorstore.retrieve("Prompting by giving examples")

[{'title': 'Advanced Prompt Engineering Techniques',
  'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.',
  'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'},
 {'title': 'Crafting Effective Prompts',
  'text': 'Incorporating Example Outputs\n\nLLMs respond well when they have specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.',
  'url': 'https://docs.cohere.com/docs/crafting-effective-prompts'},
 {'title': 'Advanced Prompt Engineering Techniques',
  'text': 'In a

# Run chatbot

![RAG components - Chatbot](../images/llmu/rag/rag-components-chatbot.png)

We can now run the chatbot. For this, we create a generate_chat function which includes the RAG components:
- For each user message, we use the endpoint’s search query generation feature to turn the message into one or more queries that are optimized for retrieval. The endpoint can even return no query, which means that a user message can be responded to directly without retrieval. This is done by calling the Chat endpoint with the search_queries_only parameter and setting it as True.
- If there is no search query generated, we call the Chat endpoint to generate a response directly. If there is at least one, we call the retrieve method from the Vectorstore instance to retrieve the most relevant documents to each query.
- Finally, all the results from all queries are appended to a list and passed to the Chat endpoint for response generation.
- We print the response, together with the citations and the list of document chunks cited, for easy reference.

In [15]:
def run_chatbot(message, chat_history=None):
    if chat_history is None:
        chat_history = []
    
    # Generate search queries, if any        
    response = co.chat(message=message,
                        model="command-r-plus",
                        search_queries_only=True,
                        chat_history=chat_history)
    
    search_queries = []
    for query in response.search_queries:
        search_queries.append(query.text)

    # If there are search queries, retrieve the documents
    if search_queries:
        print("Retrieving information...", end="")

        # Retrieve document chunks for each query
        documents = []
        for query in search_queries:
            documents.extend(vectorstore.retrieve(query))

        # Use document chunks to respond
        response = co.chat_stream(
            message=message,
            model="command-r-plus",
            documents=documents,
            chat_history=chat_history,
        )

    else:
        response = co.chat_stream(
            message=message,
            model="command-r-plus",
            chat_history=chat_history,
        )
        
    # Print the chatbot response, citations, and documents
    chatbot_response = ""
    print("\nChatbot:")

    for event in response:
        if event.event_type == "text-generation":
            print(event.text, end="")
            chatbot_response += event.text
        if event.event_type == "stream-end":
            if event.response.citations:
                print("\n\nCITATIONS:")
                for citation in event.response.citations:
                    print(citation)
            if event.response.documents:
                print("\nCITED DOCUMENTS:")
                for document in event.response.documents:
                    print(document)
            # Update the chat history for the next turn
            chat_history = event.response.chat_history

    return chat_history

Here is a sample conversation consisting of a few turns. 

In [16]:
# Turn # 1
chat_history = run_chatbot("Hello, I have a question")


Chatbot:
Of course! I am here to help. Please go ahead with your question, and I will do my best to assist you.

In [17]:
# Turn # 2
chat_history = run_chatbot("What's the difference between zero-shot and few-shot prompting", chat_history)

Retrieving information...
Chatbot:
Zero-shot prompting involves asking the model to perform a task without providing any examples. On the other hand, few-shot prompting is a technique where the model is provided with a few relevant and diverse examples of the task being performed before asking the specific question to be answered. These examples help steer the model toward a high-quality solution and condition it to the expected response type and style.

CITATIONS:
start=0 end=19 text='Zero-shot prompting' document_ids=['doc_0']
start=29 end=95 text='asking the model to perform a task without providing any examples.' document_ids=['doc_0']
start=115 end=133 text='few-shot prompting' document_ids=['doc_0']
start=159 end=217 text='model is provided with a few relevant and diverse examples' document_ids=['doc_0']
start=246 end=297 text='before asking the specific question to be answered.' document_ids=['doc_0']
start=318 end=364 text='steer the model toward a high-quality solution' docume

In [18]:
# Turn # 3
chat_history = run_chatbot("How would the latter help?", chat_history)

Retrieving information...
Chatbot:
Few-shot prompting can vastly improve the quality of the model's completions. Providing a few relevant and diverse examples helps steer the model toward a high-quality solution by conditioning it to the expected response type and style.

CITATIONS:
start=23 end=77 text="vastly improve the quality of the model's completions." document_ids=['doc_2']
start=90 end=123 text='few relevant and diverse examples' document_ids=['doc_0']
start=130 end=176 text='steer the model toward a high-quality solution' document_ids=['doc_0']
start=180 end=236 text='conditioning it to the expected response type and style.' document_ids=['doc_0']

CITED DOCUMENTS:
{'id': 'doc_2', 'text': 'Advanced Prompt Engineering Techniques\n\nSuggest Edits\n\nThe previous chapter discussed general rules and heuristics to follow for successfully prompting the Command family of models. Here, we will discuss specific advanced prompt engineering techniques that can in many cases vastly impro

In [19]:
# Turn # 4
chat_history = run_chatbot("What do you know about 5G networks?", chat_history)

Retrieving information...
Chatbot:
Sorry, I don't have any information about 5G networks. Can I help you with anything else?

There are a few observations worth pointing out:

- Direct response: For user messages that don’t require retrieval (“Hello, I have a question”), the chatbot responds directly without requiring retrieval.
- Citation generation: For responses that do require retrieval ("What's the difference between zero-shot and few-shot prompting"), the endpoint returns the response together with the citations. These are fine-grained citations, which means they refer to specific spans of the generated text.
- State management: The endpoint maintains the state of the conversation via the chat_history parameter, for example, by correctly responding to a vague user message such as "How would the latter help?"
- Response synthesis: The model can decide if none of the retrieved documents provide the necessary information to answer a user message. For example, when asked the question, “What do you know about 5G networks”, the chatbot retrieves external information from the index. However, it doesn’t use any of the information in its response as none of it is relevant to the question.

Here are the contents of the chat history.

In [20]:
print("Chat history:")
for c in chat_history:
    print(c, "\n")
print("="*50)

Chat history:
message='Hello, I have a question' tool_calls=None role='USER' 

message='Of course! I am here to help. Please go ahead with your question, and I will do my best to assist you.' tool_calls=None role='CHATBOT' 

message="What's the difference between zero-shot and few-shot prompting" tool_calls=None role='USER' 

message='Zero-shot prompting involves asking the model to perform a task without providing any examples. On the other hand, few-shot prompting is a technique where the model is provided with a few relevant and diverse examples of the task being performed before asking the specific question to be answered. These examples help steer the model toward a high-quality solution and condition it to the expected response type and style.' tool_calls=None role='CHATBOT' 

message='How would the latter help?' tool_calls=None role='USER' 

message="Few-shot prompting can vastly improve the quality of the model's completions. Providing a few relevant and diverse examples helps 