<a href="https://colab.research.google.com/github/aljebraschool/ai-startup-idea-generator/blob/master/LLM_university_RAG_with_Chat%2C_Embed%2C_and_Rerank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup
First, let’s import the necessary libraries for this project. This includes cohere, hnswlib for the vector library, and unstructured for chunking the documents

In [None]:
!pip install cohere unstructured hnswlib -q

In [None]:
import cohere
import uuid
import hnswlib
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

In [None]:
co = cohere.ClientV2("COHERE_API_KEY") # Get your free API key: https://dashboard.cohere.com/api-keys

# Create the Vectorstore Component

The Vectorstore class handles the ingestion of documents into embeddings (or vectors) and the retrieval of relevant documents given a query.

As an example, we’ll use the contents from Cohere's documentation on prompt engineering. It consists of four web pages, each in the Python list raw_documents below. Each entry is identified by its title and URL.

In [None]:
raw_documents = [
    {"title": "Crafting Effective Prompts",
     "url": "https://docs.cohere.com/docs/crafting-effective-prompts"},
    {"title": "Advanced Prompt Engineering Techniques",
     "url": "https://docs.cohere.com/docs/advanced-prompt-engineering-techniques"},
    {"title": "Prompt Truncation",
     "url": "https://docs.cohere.com/docs/prompt-truncation"},
    {"title": "Preambles",
     "url": "https://docs.cohere.com/docs/preambles"}

]

# We implement this in the Vectorstore class below, which takes the raw_documents list as input.

We also initialize a few instance attributes and methods. The attributes include self.raw_documents to represent the raw documents, self.docs to represent the chunked version of the documents, self.docs_embs to represent the embeddings of the chunked documents, and a couple of top_k parameters to be used for retrieval and reranking.

Meanwhile, the methods include load_and_chunk(), embed(), and index() for ingesting raw documents. As you’ll see, we will also specify a retrieve() method to retrieve relevant document chunks given a query.

# Load and Chunk the Documents

The load_and_chunk() method loads the raw documents from the URL and breaks them into smaller chunks. Chunking for information retrieval is a broad topic in and of itself, with many strategies being discussed within the AI community. For our example, we’ll utilize the partition_html method from the unstructured library.

Each chunk is turned into a dictionary with three fields:

title: The web page’s title
text: The textual content of the chunk
url: The web page’s URL

This information will eventually be passed to the chatbot’s prompt for generating the response, so it’s crucial to populate relevant information into this dictionary. Note that we are not limited to these three fields. At a minimum, the Chat endpoint requires the text field, but beyond that, we can add custom fields that can provide more context about the document, such as subtitles, snippets, tags, and others.

The resulting dictionaries are stored in the self.docs attribute.

# Embed the Document Chunks

The embed() method generates embeddings of the chunked documents. We use the Embed endpoint and Cohere's embed-english-v3.0 model. Since the endpoint has a limit of 96 documents per call, we send them in batches.

With the Embed v3 model, we need to define an input_type, of which there are four options depending on the type of task. Using these input types ensures the highest possible quality for the respective tasks. Since our document chunks will be used for retrieval, we use search_document as the input_type.

The resulting chunk embeddings are stored in the self.docs_embs attribute.

# Index Document Chunks

The index() method indexes the document chunk embeddings. We build an index to store the embeddings in a structured and organized way in order to ensure efficient similarity search during retrieval.

There are many options available for building an index. For production environments, typically a vector database (like Weaviate or MongoDB) is required to handle the continuous process of indexing documents and maintaining the index.

In our example, however, we’ll keep it simple and use a vector library instead. We can choose from many open-source projects, such as Faiss, Annoy, ScaNN, or Hnswlib, which is the one we’ll use. These libraries store embeddings in in-memory indexes and implement approximate nearest neighbor (ANN) algorithms to make similarity search efficient.

The resulting document chunk embeddings are stored in the self.idx attribute.

# Implement Retrieval

The retrieve() method uses semantic search to retrieve relevant document chunks given a query, and it has two steps: (1) dense retrieval, (2) reranking.

# Dense Retrieval

We implement a dense retrieval system that leverages embeddings to retrieve document chunks, offering significant improvements over basic keyword-matching approaches. Embeddings can capture the contextual meaning of a document, thus enabling the retrieval of highly relevant results to the given query.

We embed the query using the same embed-english-v3.0 model that we used to embed the document chunks, but this time, we set input_type=”search_query”.

Search is performed by the knn_query() method from the hnswlib library. Given a query, it returns the document chunks most similar to the query. We define the number of document chunks to return using the attribute self.retrieve_top_k=10.

# Reranking
After dense retrieval, we implement a reranking step. While our dense retrieval component is already highly capable of retrieving relevant sources, Cohere Rerank eprovides an additional boost to the quality of the search results, especially for complex and domain-specific queries. It takes the search results and sorts them according to their relevance to the query.

We call the Rerank endpoint with co.rerank() and pass the query and the list of document chunks to be reranked. We also define the number of top reranked document chunks to retrieve using the attribute self.rerank_top_k=3. The model we use is rerank-english-v3.0, which lets you rerank documents that contain multiple fields, in the form of JSON objects. In our case, we'll use the title and text fields for reranking.

This method returns the top retrieved document chunks as a Python list docs_retrieved, so that they can be passed to the chatbot, which we’ll implement next.

In [None]:
class Vectorstore:
    """
    A class representing a collection of documents indexed into a vectorstore.

    Parameters:
    raw_documents (list): A list of dictionaries representing the sources of the raw documents. Each dictionary should have 'title' and 'url' keys.

    Attributes:
    raw_documents (list): A list of dictionaries representing the raw documents.
    docs (list): A list of dictionaries representing the chunked documents, with 'title', 'text', and 'url' keys.
    docs_embs (list): A list of the associated embeddings for the document chunks.
    docs_len (int): The number of document chunks in the collection.
    idx (hnswlib.Index): The index used for document retrieval.

    Methods:
    load_and_chunk(): Loads the data from the sources and partitions the HTML content into chunks.
    embed(): Embeds the document chunks using the Cohere API.
    index(): Indexes the document chunks for efficient retrieval.
    retrieve(): Retrieves document chunks based on the given query.
    """

    def __init__(self, raw_documents : list[dict[str, str]]):
      self.raw_documents = raw_documents
      self.docs = []
      self.docs_embedings = []
      self.retrieve_top_k = 10
      self.rerank_top_k = 3
      self.load_and_chunk()
      self.embed()
      self.index()

    def load_and_chunk(self) -> None:
        """
        Loads the text from the sources and chunks the HTML content.
        """

        print("Loading Documents...")

        for raw_documents in self.raw_documents:
          elements = partition_html(url=raw_documents['url'])
          chunks = chunk_by_title(elements)

          for chunk in chunks:
            self.docs.append(
                {
                    "title": raw_documents["title"],
                    "text": str(chunk),
                    "url": raw_documents["url"]
                }
            )

    def embed(self) -> None:
        """
        Embeds the document chunks using the Cohere API.
        """
        print("Embedding documents chunks...")

        batch_size = 90
        self.docs_len = len(self.docs)
        for i in range(0, self.docs_len, batch_size):
          batch = self.docs[i: min(i + batch_size, self.docs_len)]
          texts = [item['text'] for item in batch]

          response = co.embed(
              texts = texts,
              model="embed-english-v3.0",
              input_type = 'search_document',
          ).embeddings
          self.docs_embedings.extend(response)

    def index(self) -> None:
        """
        Indexes the documents for efficient retrieval.
        """
        print("Indexing documents...")

        self.idx = hnswlib.Index(space = 'ip', dim = 1024)
        self.idx.init_index(max_elements = self.docs_len, ef_construction = 512, M = 64)
        self.idx.add_items(self.docs_embedings, list(range(len(self.docs_embedings))))

        print(f"indexing complete with {self.idx.get_current_count()} documents")

    def retrieve(self, query : str) -> None:
        """
        Retrieves document chunks based on the given query.

        Parameters:
        query (str): The query to retrieve document chunks for.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
        """

        # dense retrieval
        query_embedding = co.embed(
            texts = [query],
            model="embed-english-v3.0",
            input_type = 'search_query'
        ).embeddings

        doc_ids = self.idx.knn_query(query_embedding, k = self.retrieve_top_k)[0][0]

        #Reranking
        rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

        doc_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]

        rerank_result = co.rerank(
            query = query,
            documents = doc_to_rerank,
            top_n = self.rerank_top_k,
            model = 'rerank-english-v3.0',
            rank_fields = rank_fields


        )

        docs_ids_reranked = [doc_ids[result.index]  for result in rerank_result.results]

        docs_retrieved = []

        for doc_id in docs_ids_reranked:
          docs_retrieved.append(
              {"title": self.docs[doc_id]['title'],
               "text": self.docs[doc_id]["text"],
               "url": self.docs[doc_id]["url"]


               }

          )

        return docs_retrieved



# Process the Documents

We can now process the raw documents. We do that by creating an instance of Vectorstore. In our case, we get a total of 136 documents, chunked from the four web URLs.

In [None]:
vectorstore = Vectorstore(raw_documents)

Loading Documents...
Embedding documents chunks...
Indexing documents...
indexing complete with 105 documents


# Test Retrieval

Before going further, we first test the document retrieval part of the system. First, we create an instance of the Vectorstore with the raw documents that we have defined. Then, we use the retrieve method to retrieve the most relevant documents to the query "Prompting by giving examples."

In [None]:
vectorstore.retrieve("Prompting by giving examples")

[{'title': 'Advanced Prompt Engineering Techniques',
  'text': 'Few-shot Prompting\n\nUnlike the zero-shot examples above, few-shot prompting is a technique that provides a model with examples of the task being performed before asking the specific question to be answered. We can steer the LLM toward a high-quality solution by providing a few relevant and diverse examples in the prompt. Good examples condition the model to the expected response type and style.',
  'url': 'https://docs.cohere.com/docs/advanced-prompt-engineering-techniques'},
 {'title': 'Crafting Effective Prompts',
  'text': 'Incorporating Example Outputs\n\nLLMs respond well when they have specific examples to work from. For example, instead of asking for the salient points of the text and using bullet points “where appropriate”, give an example of what the output should look like.',
  'url': 'https://docs.cohere.com/docs/crafting-effective-prompts'},
 {'title': 'Advanced Prompt Engineering Techniques',
  'text': 'In a

We can now run the chatbot. For this, we create a generate_chat function which includes the RAG components:
- For each user message, we use the endpoint’s search query generation feature to turn the message into one or more queries that are optimized for retrieval. The endpoint can even return no query, which means that a user message can be responded to directly without retrieval. This is done by calling the Chat endpoint with the search_queries_only parameter and setting it as True.
- If there is no search query generated, we call the Chat endpoint to generate a response directly. If there is at least one, we call the retrieve method from the Vectorstore instance to retrieve the most relevant documents to each query.
- Finally, all the results from all queries are appended to a list and passed to the Chat endpoint for response generation.
- We print the response, together with the citations and the list of document chunks cited, for easy reference.

In [None]:
def run_chatbot(message, chat_history=[]):
  # Generate search queries, if any
  response = co.chat(message = message,
                     model="command-r-plus",
                     search_queries_only=True,
                     chat_history=chat_history
                     )

  search_queries = []
  for query in response.search_queries:
    search_queries.append(query.text)

  # If there are search queries, retrieve the documents
  if search_queries:
    print("Retrieving Information...",  end = '')

    # Retrieve document chunks for each query
    docs_retrieved = []
    for query in search_queries:
      docs_retrieved.extend(vectorstore.retrieve(query))

    # Use document chunks to respond
    response = co.chat_stream(
        message = message,
        model="command-r-plus",
        documents = docs_retrieved,
        chat_history=chat_history
    )

  else:
    response = co.chat_stream(
        message = message,
        model="command-r-plus",
        chat_history=chat_history
    )

  # Print the chatbot response and citations
  chatbot_response = ""
  print("\nChatbot:")

  for event in response:
    if event.event_type == "text-generation":
      print(event.text, end = "")
      chatbot_response += event.text

    if event.event_type == "stream-end":
      if event.response.citations:
        print("\n\nCITATIONS:")
        for citation in event.response.citations:
          print(citation)

      if event.response.documents:
        print("\n\nDOCUMENTS:")
        for document in event.response.documents:
          print(document)

      # Update the chat history for the next turn
      chat_history = event.response.chat_history

  return chat_history


# Search Query Generation
Let's take a deeper look at the search query generation feature. Based on the user message, the chatbot needs to decide if it needs to consult external information before responding. If so, the chatbot determines an optimal set of search queries to use for retrieval. When we call co.chat() with search_queries_only=True, the Chat endpoint handles this for us automatically.

The generated queries can be accessed from the search_queries field of the object that is returned. To understand how this works, let’s look at a few scenarios:

No query needed: Suppose we have a user message of “Hello, I need help with a report I'm writing”. This type of message doesn’t require any additional context from external information, so retrieval is not required. A direct chatbot response will suffice (for example: “Sure, how can I help?”). When we send this to the Chat endpoint, we get an empty search_queries result, which is what we expect.
One query generated: Take this user message: "What did the report say about the company's Q4 performance?” This does require additional context as it refers to a report, hence retrieval is required. Given this message, the Chat endpoint returns the search_queries result of Q4 company performance. Here it turns the user message into a query optimized for search. Another important scenario is generating queries in the context of the conversation. Suppose there’s an ongoing conversation where the user is learning from the chatbot about deep learning. If at some point, the user asks, “Why is it important”, then the generated search_queries will become why is deep learning important, providing the much-needed context for the retrieval process.
More than one query generated: What if the user message is a bit more complex, such as "What did the report say about the company's Q4 performance and its range of products and services?” This requires multiple pieces of information to be retrieved. Given this message, the Chat endpoint returns two search_queries results: Q4 company performance and company's range of products and services.

These scenarios highlight the adaptability of the Chat endpoint to decide on the next course of action based on a user message.

# Document Retrieval
Let's take a deeper look at the document retrieval step. What happens next depends on how many search queries are returned.

If search queries are returned

If the chatbot response contains at least one search query, we call the retrieve() method from the Vectorstore class instance to retrieve document chunks that are relevant to the queries.

Then, we call the Chat endpoint to generate a response, adding a documents parameter to the call to pass the relevant document chunks.

If no search queries are returned

Meanwhile, if the chatbot response doesn’t contain any search queries, then it doesn’t require information retrieval. To generate the response, we call the Chat endpoint another time, passing the user message and without needing to add any sources to the call.

In either case, we also pass the chat_history parameter, which retains the interactions between the user and the chatbot in the same conversation thread. We also use the chat_stream endpoint so we can stream the chatbot response to the application.

# Response and Citation Generation
Let's take a deeper look at the response generation step. The chatbot response includes a stream of events, such as the generated text and citations followed by a final object which contains the sources used by the chatbot along with other details.

To display the response, we use the text-generation events from the response stream.

The citation-generation events indicate the spans of text from the retrieved document chunks on which the response is grounded. Here is one example:



> start=382 end=397 text='similar vectors' document_ids=['doc_0', 'doc_2']

The format of each citation is:

start: The starting point of a span where one or more documents are referenced
end: The ending point of a span where one or more documents are referenced
text: The text representing this span
document_ids: The IDs of the document chunks being referenced (doc_0 being the ID of the first document chunk passed to the documents creating parameter in the endpoint call, and so on)
The final response object includes a list of the document chunks, which we access from the documents attribute.



# Example conversation
Here’s an example of a conversation that happens over a few turns:

In [None]:
# Turn # 1

chat_history = run_chatbot("Hello, I have a question!")


Chatbot:
Of course! I am here to help. Please go ahead with your question and I will do my best to assist you.

In [None]:
# Turn # 2
chat_history = run_chatbot("what is the different between zero-shot and few-shot prompting?", chat_history)

Retrieving Information...
Chatbot:
Zero-shot prompting is a technique where a model is asked to perform a task without being provided with any examples. On the other hand, few-shot prompting involves providing a model with a few relevant examples of the task being performed before asking the specific question to be answered. This helps steer the model toward a high-quality solution.

CITATIONS:
start=0 end=19 text='Zero-shot prompting' document_ids=['doc_0', 'doc_3']
start=43 end=117 text='model is asked to perform a task without being provided with any examples.' document_ids=['doc_0', 'doc_3']
start=137 end=155 text='few-shot prompting' document_ids=['doc_0', 'doc_3']
start=165 end=239 text='providing a model with a few relevant examples of the task being performed' document_ids=['doc_0', 'doc_3']
start=258 end=291 text='specific question to be answered.' document_ids=['doc_0', 'doc_3']
start=303 end=350 text='steer the model toward a high-quality solution.' document_ids=['doc_0', 'd

In [None]:
# Turn # 3
chat_history = run_chatbot("how will the latter help?", chat_history)

Retrieving Information...
Chatbot:
Few-shot prompting can vastly improve the quality of a model's completions. Providing a few relevant and diverse examples in the prompt helps steer the model toward a high-quality solution by conditioning it to the expected response type and style.

CITATIONS:
start=23 end=75 text="vastly improve the quality of a model's completions." document_ids=['doc_1']
start=88 end=121 text='few relevant and diverse examples' document_ids=['doc_0']
start=142 end=188 text='steer the model toward a high-quality solution' document_ids=['doc_0']
start=192 end=248 text='conditioning it to the expected response type and style.' document_ids=['doc_0']


DOCUMENTS:
{'id': 'doc_1', 'text': 'Advanced Prompt Engineering Techniques\n\nThe previous chapter discussed general rules and heuristics to follow for successfully prompting the Command family of models. Here, we will discuss specific advanced prompt engineering techniques that can in many cases vastly improve the quali

In [None]:
# Turn # 4
chat_history = run_chatbot("What do you know about 5G network?", chat_history)

Retrieving Information...
Chatbot:
Sorry, I do not have access to any information about the 5G network. Can I help you with anything else?

There are a few observations worth pointing out:

- Direct response: For user messages that don’t require retrieval (“Hello, I have a question”), the chatbot responds directly without requiring retrieval.
- Citation generation: For responses that do require retrieval ("What's the difference between zero-shot and few-shot prompting"), the endpoint returns the response together with the citations. These are fine-grained citations, which means they refer to specific spans of the generated text.
- State management: The endpoint maintains the state of the conversation via the chat_history parameter, for example, by correctly responding to a vague user message such as "How would the latter help?"
- Response synthesis: The model can decide if none of the retrieved documents provide the necessary information to answer a user message. For example, when asked the question, “What do you know about 5G networks”, the chatbot retrieves external information from the index. However, it doesn’t use any of the information in its response as none of it is relevant to the question.

In [None]:
print("Chat history:")
for c in chat_history:
    print(c, "\n")

Chat history:
role='USER' message='Hello, I have a question!' tool_calls=None 

role='CHATBOT' message='Of course! I am here to help. Please go ahead with your question and I will do my best to assist you.' tool_calls=None 

role='USER' message='what is the different between zero-shot and few-shot prompting?' tool_calls=None 

role='CHATBOT' message='Zero-shot prompting is a technique where a model is asked to perform a task without being provided with any examples. On the other hand, few-shot prompting involves providing a model with a few relevant examples of the task being performed before asking the specific question to be answered. This helps steer the model toward a high-quality solution.' tool_calls=None 

role='USER' message='how will the latter help?' tool_calls=None 

role='CHATBOT' message="Few-shot prompting can vastly improve the quality of a model's completions. Providing a few relevant and diverse examples in the prompt helps steer the model toward a high-quality solutio