# Chatbot for understanding research papers better.

Motivation:
-  Reading papers can be monotonous sometimes and I have often understood a paper better through reading groups where there is someone presenting and people are asking questions.
- Command-R has been specifically trained with grounded generation capabilities.
- It can generate responses based on a list of supplied document snippets, and will include grounding spans (citations) in its response indicating the source of the information.
- This makes it the perfect LLM for this usecase.

Technical Details:
- Utilize RAG with the Command-R model to help understand papers better.
- Provides a QA interface where a user can ask questions and the model can provide answers along with citations.


In this notebook we build a chatbot and a VectorDB. The chatbot leverages the VectorDB to answer a user's questions.

References:
- Cohere LLM University and Developer Guides.

## Imports

In [1]:
## installs
%%capture
!pip install cohere langchain pypdf hnswlib

In [2]:
# Use langchain for parsing pdfs
import cohere
import hnswlib
from langchain.document_loaders import PyPDFLoader
from langchain_core.documents.base import Document
import uuid
import urllib.request

co = cohere.Client("api_key")

## Create Vector DB

In [3]:
class Vectorstore:
    def __init__(self, paper_url: str) -> None:
        """Initialize the Vectorstore.

        Populates the VectorStore with embeddings and documents.

        Args:
            paper_url (str): URL to the paper's pdf.

        Returns:
            None
        """
        self.docs = self.load_docs(paper_url) # can be extended for multiple docs (creating a paper db)
        self.docs_len = len(self.docs)
        self.docs_embs = []
        self.retrieve_top_k = 10 # can be tuned for optimal results
        self.rerank_top_k = 3 # can be tuned for optimal results
        self.embed_batch_size = 50 # can be optimized for api calls
        self.embed()
        self.index()

    def load_docs(self, paper_url: str) -> list[Document]:
        """ Utilizes the PyPDFLoader to load the paper (pdf) and split it into pages.

        Args:
            paper_url (str): URL to the paper's pdf.

        Returns:
            pages (list[Document]): List of pages represented as langchain docs.

        """
        local_path = "./paper.pdf"
        urllib.request.urlretrieve(paper_url, local_path)
        pdf_loader = PyPDFLoader(local_path)
        # split pages from pdf
        return pdf_loader.load_and_split()

    def embed(self) -> None:
        """Embeds the document chunks using the Cohere Embed API."""

        for i in range(0, self.docs_len, self.embed_batch_size):

            batch = self.docs[i : min(i + self.embed_batch_size, self.docs_len)]
            texts = [item.page_content for item in batch]

            self.docs_embs.extend(co.embed(texts=texts, model="embed-english-v3.0", input_type="search_document").embeddings)

        print("Embeddings Generated for documents.")

    def index(self) -> None:
        """Indexes the documents for efficient retrieval."""

        # NOTE: these params are hard-coded but can be tuned/updated.
        self.idx = hnswlib.Index(space="ip", dim=1024)
        self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
        self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.idx.get_current_count()} documents.")

    def retrieve(self, query: str) -> list[dict[str, str]]:
        """ Retrieves document chunks based on the given query.

        Args:
            query (str): The query to retrieve document chunks for.

        Returns:
            retrieved_docs (list[dict[str, str]]): List of dicts representing the retrieved document chunks, with 'text' and 'metadata'.
        """

        # Dense retrieval
        # NOTE: we can do this for other languages as well.
        query_emb = co.embed(
            texts=[query], model="embed-english-v3.0", input_type="search_query"
        ).embeddings

        doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]

        # Reranking
        docs_to_rerank = [self.docs[doc_id].page_content for doc_id in doc_ids]

        rerank_results = co.rerank(
            query=query,
            documents=docs_to_rerank,
            top_n=self.rerank_top_k,
            model="rerank-english-v2.0",
        ).results

        doc_ids_reranked = [doc_ids[result.index] for result in rerank_results]

        retrieved_docs = [
            {
                "text": self.docs[doc_id].page_content,
                "metadata": self.docs[doc_id].metadata,
            }
            for doc_id in doc_ids_reranked
        ]

        return retrieved_docs

In [4]:
vectorstore = Vectorstore("https://arxiv.org/pdf/1706.03762.pdf")

Embeddings Generated for documents.
Indexing complete with 16 documents.


## Create Chatbot

In [5]:
class PaperBot:
    def __init__(self, vectorstore: Vectorstore) -> None:
        """Initializes an instance of the PaperBot class.

        This acts as a chatbot that the user can interact with to discuss research papers (can be extended to any text based pdfs)

        Parameters:
        vectorstore (Vectorstore): An instance of the Vectorstore class.
        """
        self.vectorstore = vectorstore
        self.conversation_id = str(uuid.uuid4()) # required to keep context within the cohere chat.

    def run(self) -> None:
        """ Runs the chatbot application.

        Opens up an interactive chat window for the user.
        Can exit with the input: quit.
        """
        while True:
            # Manage interactions
            message = input("USER: ")

            if message.lower() == "quit":
                print("Thanks for chatting, talk to you soon!")
                break


            # Generate search queries, if any
            response = co.chat(message=message, search_queries_only=True)

            # If there are search queries, retrieve document chunks and respond
            if response.search_queries:
                print("Retrieving information...", end="")

                # Retrieve document chunks for each query
                documents = []
                for query in response.search_queries:
                    documents.extend(self.vectorstore.retrieve(query.text))

                document_contents = [{"text": doc["text"]} for doc in documents]

                # Use document chunks to respond
                response = co.chat_stream(
                    message=message,
                    model="command-r",
                    documents=document_contents,
                    conversation_id=self.conversation_id,
                )
            else:
                response = co.chat_stream(
                    message=message,
                    model="command-r",
                    conversation_id=self.conversation_id,
                )

            # Build chatbot output
            print("\nPAPER_BOT:")
            citations = []
            cited_documents = []

            # Display response
            for event in response:
                if event.event_type == "text-generation":
                    print(event.text, end="")
                elif event.event_type == "citation-generation":
                    citations.extend(event.citations)
                elif event.event_type == "search-results":
                    cited_documents = event.documents

            # Display citations and source documents
            if citations:
                print("\n\nCITATIONS:")
                for citation in citations:
                    print(citation)

                print("\nDOCUMENTS:")
                for document in cited_documents:
                    print(document)

            print(f"\n{'-'*100}\n")

In [6]:
bot = PaperBot(vectorstore)

In [7]:
# Queries:

# What are the key contributions of the paper: Attention is all you need?
# Could you elaborate more on how parallelism works in the Transformer architecture?
# Could you talk about the benchmarks that this new architecture was able to beat?
# Could you make me a table of these benchmarks?
# Please enhance the table showing the performance of the Transformer as compared to the previous best models. (If generation is off)
# In the above table, does a higher score mean better performance?

In [8]:
bot.run()

USER: What are the key contributions of the paper: Attention is all you need?
Retrieving information...
PAPER_BOT:
The paper "Attention is all you need" proposes a new network architecture called the Transformer. The Transformer, unlike existing models, is based solely on attention mechanisms, doing away with recurrence and convolutions. 

The authors of the paper argue that the Transformer is superior in quality to existing models, whilst being more parallelisable and requiring significantly less training time. To demonstrate the Transformer's efficacy, the authors applied it to two machine translation tasks. On the WMT 2014 English-to-German translation task, the model achieved a 28.4 BLEU score, which was an improvement over the previous best score. Additionally, the Transformer generalises well to other tasks, as shown by its successful application to English constituency parsing.

Furthermore, the Transformer has constant computational complexity for each layer, which enables the 

## Comparison to Command-R + tool-connector (web search)

- Command-R has been specifically trained with conversational tool use capabilities.
- Given this it would be useful to compare the RAG approach to the tool-use approach where it can use the search tool without access to the actual paper.

In [9]:
class SearchBot:
    def __init__(self, connectors: list[str]):
        """Initializes an instance of the SearchBot class.

        Parameters:
        connectors (list[str]): list of connectors for the bot to use.
        """
        self.conversation_id = str(uuid.uuid4())
        self.connectors = [cohere.ChatConnector(id=connector) for connector in connectors]

    def run(self):
        """ Runs the chatbot application.

        Opens up an interactive chat window for the user.
        Can exit with the input: quit.
        """
        while True:
            # Manage interactions
            message = input("USER: ")

            if message.lower() == "quit":
                print("Thanks for chatting, talk to you soon!")
                break


            # Generate response
            response = co.chat_stream(
                    message=message,
                    model="command-r",
                    conversation_id=self.conversation_id,
                    connectors=self.connectors,
            )

            # Print the chatbot response, citations, and documents
            print("\nSEARCH_BOT:")
            citations = []
            cited_documents = []

            # Display response
            for event in response:
                if event.event_type == "text-generation":
                    print(event.text, end="")
                elif event.event_type == "citation-generation":
                    citations.extend(event.citations)
                elif event.event_type == "search-results":
                    cited_documents = event.documents

            # Display citations and source documents
            if citations:
              print("\n\nCITATIONS:")
              for citation in citations:
                print(citation)

              print("\nDOCUMENTS:")
              for document in cited_documents:
                print({'id': document['id'],
                      'snippet': document['snippet'][:50] + '...',
                      'title': document['title'],
                      'url': document['url']})

            print(f"\n{'-'*100}\n")

In [10]:
s_bot = SearchBot(["web-search"])

In [11]:
# Queries:

# What are the key contributions of the paper: Attention is all you need?
# Could you elaborate more on how parallelism works in the Transformer architecture?
# Could you talk about the benchmarks that this new architecture was able to beat?
# Could you make me a table of these benchmarks?
# Please enhance the table showing the performance of the Transformer as compared to the previous best models. (If generation is off)
# In the above table, does a higher score mean better performance?

In [13]:
s_bot.run()

USER: What are the key contributions of the paper: Attention is all you need?

SEARCH_BOT:
The paper "Attention is All You Need" introduces the Transformer architecture along with its key components. The Transformer is a novel sequence transduction model based entirely on attention mechanisms, dispensing with recurrence and convolutions. 

The paper's main contribution lies in demonstrating the effectiveness of the attention mechanism in NLP tasks, which enables the model to capture long-term dependencies and contextual relationships between words. The Transformer's ability to process input sequences concurrently using self-attention improves training speed and efficiency. 

Another key insight is the introduction of multi-head attention, which allows the model to focus on different parts of the input sequence simultaneously, capturing complex relationships between the input and output sequences.

The Transformer architecture has achieved state-of-the-art results in several NLP tasks, 

## Takeaways

- When comparing the PaperBot to the SearchBot we notice that the SearchBot has access to a lot more than the paper when generating answers.
- If we peruse the citations we often see the model using the actual paper along with various blog articles, etc.
- This results in a variety in the answers but can also lead to noise.

Eg:
    - When talking about parallelism it starts explaining what data/model parallelism means instead of sticking to the context of the paper.
    - Since it was connected to search it grabbed results that came after the paper was published when generating the table.



Overally the results are fascinating because the grounding in data is amazing.
We can see that the model is highly optimized for RAG applications as well as tool-use.
The noise can be managed via prompt tuning in my opinion.

## Potential Next Steps/Extensions

- This can easily extend to multiple languages (both the search and pdf use-case)
- The RAG application could be enhanced to use a paper-database of some sort so that the user doesn't need to provide the pdf url everytime.
- The tool-use could be extended to incorporate a variety of connectors supported by cohere.
Eg:
    - Daily paper discussions on slack with your teammates.
    - Generate presentations or articles using medium.
    - Create a paper knowledge base in notion.
    - etc.