# Research Paper Engine using arXiv, LangChain 🦜️🔗 and Google Gemini

| | |
|-|-|
|Author(s) | [Tahreem Rasul](https://github.com/tahreemrasul) |

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/tahreemrasul/rag_research_paper_engine_workshop/blob/main/rag_research_paper_engine.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Run in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/tahreemrasul/rag_research_paper_engine_workshop/blob/main/rag_research_paper_engine.ipynb">
      <img width="28px" src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

## Overview

This notebook demonstrates implementing a research paper engine using the arXiv API to show how to improve LLM's response by augmenting LLM's knowledge with external data sources such as documents. The notebooks uses Vertex AI Gemini Pro 1.0 for Text, Embeddings for Text API, arXiv API and LangChain 🦜️🔗.

## Context

Large Language Models (LLMs) have improved quantitatively and qualitatively. They can learn new abilities without being directly trained on them. However, there are constraints with LLMs - they are unaware of events after training and it is almost impossible to trace the sources to their responses. It is preferred for LLM based systems to cite their sources and be grounded in facts.

To solve for the constraints, one of the approaches is to augment the prompt sent to LLM with relevant data retrieved from an external knowledge base through Information Retrieval (IR) mechanism.

This approach is called Retrieval Augmented Generation (RAG), also known as Generative QA in the context of a Question Answering task. There are two main components in RAG based architecture: (1) Retriever and (2) Generator.

## Getting Started

### Install Vertex AI SDK, other packages and their dependencies

Install the following packages required to execute this notebook.

In [1]:
# Install LangChain and related packages
!pip install --upgrade --quiet langchain langchain-google-vertexai langchain-community chromadb arxiv pymupdf

### Restart current runtime

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel.

In [2]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

<div class="alert alert-block alert-warning">
<b>⚠️ Before proceeding, please wait for the kernel to finish restarting ⚠️</b>
</div>

### Authenticating your notebook environment

If you are using Colab, you will need to authenticate yourself first. The next cell will check if you are currently using Colab, and will start the authentication process.

If you are using Vertex AI Workbench, you will not require additional authentication.

For more information, you can check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [3]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

## Retrieve Relevant Papers from arXiv API

This step retrieves relevant research papers based on the user query. The document corpus used as dataset will be the research papers pulled from the `arXiv` API. We will be using the `ArxivLoader` class from LangChain to load the PDFs of these papers.

In [4]:
# @title Query & No. of Papers { display-mode: "form" }
query = "neural networks"  # @param {type:"string"}

# @title Total Docs { display-mode: "form" }
num_papers = "3"  # @param {type: "string"}

In [5]:
from langchain_community.document_loaders import ArxivLoader

arxiv_docs = ArxivLoader(query=query, load_max_docs=int(num_papers)).load()

Once retreived, display the metadata to check which papers were returned

In [6]:
for i in range(int(num_papers)):
  print(f"Paper # {i+1}:")
  print(f"Published: {arxiv_docs[i].metadata['Published']}")
  print(f"Title: {arxiv_docs[i].metadata['Title']}")
  print(f"Authors: {arxiv_docs[i].metadata['Authors']}")
  print(f"Summary: {arxiv_docs[i].metadata['Summary']}")
  print('------------------------------------------------------------------------------------------------------------')


Paper # 1:
Published: 2023-04-18
Title: Lecture Notes: Neural Network Architectures
Authors: Evelyn Herberg
Summary: These lecture notes provide an overview of Neural Network architectures from
a mathematical point of view. Especially, Machine Learning with Neural Networks
is seen as an optimization problem. Covered are an introduction to Neural
Networks and the following architectures: Feedforward Neural Network,
Convolutional Neural Network, ResNet, and Recurrent Neural Network.
------------------------------------------------------------------------------------------------------------
Paper # 2:
Published: 2023-11-18
Title: Bayesian Neural Networks: A Min-Max Game Framework
Authors: Junping Hong, Ercan Engin Kuruoglu
Summary: Bayesian neural networks use random variables to describe the neural networks
rather than deterministic neural networks and are mostly trained by variational
inference which updates the mean and variance at the same time. Here, we
formulate the Bayesian neural 

## Chunk documents - TextSplitter

Split the documents retrieved into smaller chunks. When splitting the document, ensure a few chunks can fit within the context length of LLM.

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

pdf_data = []
for doc in arxiv_docs:
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    doc_splits = text_splitter.create_documents([doc.page_content])
    for idx, split in enumerate(doc_splits):
      split.metadata["chunk"] = idx
    pdf_data.append(doc_splits)

print(f"# of pdfs = {len(pdf_data)} \n# of split documents = {sum([len(doc_splits) for doc_splits in pdf_data])}")

# of pdfs = 3 
# of split documents = 150


## Create the VertexAI Embedding model

In [20]:
# @title Project { display-mode: "form" }
PROJECT_ID = "build-with-ai-424207"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

# @title Region { display-mode: "form" }
REGION = "US-central-1"  # @param {type: "string"}


Updated property [core/project].


In [15]:
from langchain_google_vertexai import VertexAIEmbeddings

embedding_model = VertexAIEmbeddings(
    model_name="textembedding-gecko@latest", project=PROJECT_ID
)
print(embedding_model)

client=<vertexai.language_models.TextEmbeddingModel object at 0x7f37607ed5d0> async_client=None project='build-with-ai-424207' location='us-central1' request_parallelism=5 max_retries=6 stop=None model_name='textembedding-gecko@latest' model_family=None full_model_name=None client_preview=None temperature=None max_output_tokens=None top_p=None top_k=None credentials=None n=1 streaming=False safety_settings=None api_transport=None api_endpoint=None tuned_model_name=None instance={'max_batch_size': 250, 'batch_size': 250, 'min_batch_size': 5, 'min_good_batch_size': 5, 'lock': <unlocked _thread.lock object at 0x7f376213b540>, 'batch_size_validated': False, 'task_executor': <concurrent.futures.thread.ThreadPoolExecutor object at 0x7f376146fb50>, 'embeddings_task_type_supported': True, 'get_embeddings_with_retry': <function _TextEmbeddingModel.get_embeddings at 0x7f3769550310>}


## Configure `ChromaDB` as Vector Store

This step generates embeddings from the documents and adds the embeddings to the vector store. The vector store being used is the `Chroma` database.


In [17]:
!gcloud services enable aiplatform.googleapis.com

from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(pdf_data[0], embedding_model)



In [34]:
# @title search query { display-mode: "form" }
search_query = "What should be considered when taking derivatives of ReLU?"  # @param {type:"string"}

Verify the `ChromaDB` with similarity search

In [35]:
db.similarity_search(
    search_query
)

[Document(page_content='identity function often helps speed up convergence, since it resembles a linear model, as long as\nthe values are close to zero. Another challenge that needs to be overcome is vanishing derivatives,\nwhich is visibly present for Heaviside, sigmoid and hyperbolic tangent. In contrast, ReLU is not\nbounded on positive values, while also being comparatively cheap to compute, because linear\ncomputations tend to be very well optimized in modern computing. Altogether, these advan-\ntages have resulted in ReLU (and variants thereof) becoming the most widely used activation\nfunction currently. As a remedy for the vanishing gradient on negative values, leaky ReLU was\nintroduced. When taking derivatives of ReLU one needs to account for the non-diﬀerentiability\nat 0, but in numerical practice this is easily overcome.\nWith the help of Neural Networks we want to solve a task, cf.\n[15, Section 5.1].\nLet the\nperformance of the algorithm for the given task be measured b

## Retrieval based Question/Answering Chain

We will demonstrate using three LangChain retrieval Q&A chains:

- `RetrievalQA`
- `ConversationalRetrievalChain`
- Advanced: customized Q&A prompt and format

We begin by initializing a Vertex AI LLM and a LangChain retriever to fetch documents from our Chroma Database containing ingested pdfs of papers we fetched earlier.

For Q&A chains our retriever is passed directly to the chain and can be used automatically without any further configuration.

Behind the scenes, first the search query is passed to the retriever which runs a search and returns relevant document chunks.

These chunks are then passed to the prompt used by the LLM to be used as context.

In [36]:
from langchain_google_vertexai import VertexAI
from langchain.chains import RetrievalQA

llm = VertexAI(model_name="gemini-pro")

retriever = db.as_retriever()

### `RetrievalQA` chain

This is the simplest document Q&A chain offered by LangChain.

There are several different chain types available.

- In these examples we use the `stuff` type, which simply inserts all of the document chunks into the prompt.
- This has the advantage of only making a single LLM call, which is faster and more cost efficient.
- However, if we have a large number of search results we run the risk of exceeding the token limit in our prompt, or truncating useful information.
- Other chain types such as `map_reduce` and `refine` use an iterative process which makes multiple LLM calls, taking individual document chunks at a time and refining the answer iteratively.

In [37]:
retrieval_qa = RetrievalQA.from_chain_type(llm=llm,
                                           chain_type="stuff",
                                           retriever=retriever)

retrieval_qa.invoke(search_query)

{'query': 'What should be considered when taking derivatives of ReLU?',
 'result': 'Here is what should be considered when taking derivatives of ReLU: \n\n* ReLU is not differentiable at 0. This means that the derivative of ReLU is undefined at 0. In practice, this is usually handled by using a small epsilon value to approximate the derivative at 0. For example, the derivative of ReLU at 0 could be approximated as $\\epsilon$ or $-\\epsilon$, where $\\epsilon$ is a small positive number.\n* ReLU may suffer from the "dying ReLU" problem. This problem occurs when a ReLU neuron gets stuck in a state where it is always outputting 0. This can happen if the input to the neuron is always negative, or if the weights of the neuron are very small. When a ReLU neuron is stuck in the "dying" state, it can no longer learn anything new.\n* ReLU is not as robust to noise as other activation functions, such as the sigmoid function or the tanh function. This is because ReLU is a piecewise linear functi

#### Inspecting the process

If we add `return_source_documents=True` we can inspect the document chunks that were returned by the retriever.

This is helpful for debugging, as these chunks may not always be relevant to the answer, or their relevance might not be obvious.

In [38]:
retrieval_qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=retriever,
                                 return_source_documents=True)

results = retrieval_qa.invoke(search_query)

print("*" * 79)
print(results["result"])
print("*" * 79)
for doc in results["source_documents"]:
    print("-" * 79)
    print(doc.page_content)

*******************************************************************************
The text provided doesn't specifically discuss taking derivatives of ReLU, but it does mention that ReLU is not differentiable at 0. When taking derivatives of ReLU, you need to account for this non-differentiability at 0. However, the text mentions that in numerical practice this is easily overcome. So, while being aware of the non-differentiability at 0 is important, it doesn't seem to be a significant challenge when working with ReLU in practice. 

*******************************************************************************
-------------------------------------------------------------------------------
identity function often helps speed up convergence, since it resembles a linear model, as long as
the values are close to zero. Another challenge that needs to be overcome is vanishing derivatives,
which is visibly present for Heaviside, sigmoid and hyperbolic tangent. In contrast, ReLU is not
bounded o

## ConversationalRetrievalChain
`ConversationalRetrievalChain` remembers and uses previous questions so you can have a chat-like discovery process.

To use this chain we must provide a memory class to store and pass the previous messages to the LLM as context. Here we use the `ConversationBufferMemory` class that comes with LangChain.

In [41]:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
conversational_retrieval = ConversationalRetrievalChain.from_llm(llm=llm,
                                                                 retriever=retriever,
                                                                 memory=memory)


conversational_retrieval.invoke(search_query)["answer"]

'The answer to your question can be found in the text you provided. When taking derivatives of ReLU one needs to account for the non-diﬀerentiability at 0. '

In [43]:
new_query = "What about other activation functions?"
result = conversational_retrieval.invoke(new_query)
print(result["answer"])

## Considerations when taking derivatives of other activation functions:

While your provided context mentions the challenges of taking derivatives for Heaviside, sigmoid, and hyperbolic tangent functions, let's focus on **ReLU** and **Leaky ReLU** as the question prompts.

### ReLU (Rectified Linear Unit)

* **Non-differentiability at 0:** The biggest challenge with ReLU is its non-differentiability at 0. This means that the derivative is undefined at that point, causing issues for gradient-based optimization algorithms.
* **Practical solutions:** In practice, this is often handled by considering the derivative as 0 at 0. Alternatively, a small positive value can be used instead, resulting in a "smoothed" version of the derivative at 0.
* **Computational efficiency:** ReLU is computationally efficient as it only involves a simple comparison and thresholding operation, making it attractive for large-scale deep learning tasks.

### Leaky ReLU

* **Addressing vanishing gradients:** Leaky

In [45]:
new_query = "give me specifically for sigmoid"
result = conversational_retrieval.invoke(new_query)
print(result["answer"])

The derivative of the sigmoid function is $\sigma'(x) = \sigma(x)(1-\sigma(x))$, where $\sigma(x)$ is the sigmoid function, defined as $\sigma(x) = \frac{1}{1 + e^{-x}}$. 

This formula can be derived using the chain rule and the definition of the sigmoid function. For example,
```
\begin{aligned}
\frac{d}{dx} \sigma(x) &= \frac{d}{dx} \left( \frac{1}{1 + e^{-x}} \right) \\
&= \frac{(1 + e^{-x})(-e^{-x})}{(1 + e^{-x})^2} \\
&= \frac{-e^{-x}}{1 + e^{-x}} \cdot \frac{1}{1 + e^{-x}} \\
&= \sigma(x) (1 - \sigma(x)).
\end{aligned}
```

The derivative of the sigmoid function is important in the context of neural networks because it is used to calculate the gradient during backpropagation. The gradient is used to update the weights of the neural network during training, and the derivative of the sigmoid function helps to determine how the weights should be adjusted.



In [46]:
# # Ingest PDF files
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# # split the documents into chunks
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=1000,
#     chunk_overlap=50,
#     separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
# )
# doc_splits = text_splitter.split_documents(documents)

# # Add chunk number to metadata
# for idx, split in enumerate(doc_splits):
#     split.metadata["chunk"] = idx

# print(f"# of documents = {len(doc_splits)}")

NameError: name 'documents' is not defined

## Advanced: Modifying the default langchain prompt

In all of the previous examples we used the default prompt that comes with Langchain.

We can inspect our chain object to discover the wording of the prompt template being used.

We may find that this is not suitable for our purposes, and we may wish to customise the prompt, for example to present our results in a different format, or to specify additional constraints.

In [48]:
qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True
)

print(qa.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:


Let's modify the prompt to return an answer in a single word (useful for yes/no questions). We will constrain the LLM to say 'I don't know' if it cannot answer.

We create a new prompt_template and pass this in using the template argument.

In [49]:
from langchain.prompts import PromptTemplate

template = """SYSTEM: You are an intelligent research assistant helping the users with their research paper questions.

Question: {question}

Strictly Use ONLY the following pieces of context to answer the question at the end. Think step-by-step and then answer.

Do not try to make up an answer:
 - If the answer to the question cannot be determined from the context alone, say "I cannot determine the answer to that."
 - If the context is empty, just say "I do not know the answer to that."

=============
{context}
=============

Question: {question}
Helpful Answer:"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

We can also customize the retriever

In [50]:
# @title No. of Results { display-mode: "form" }
NUMBER_OF_RESULTS = "3"  # @param {type:"string"}

In [51]:
# Create chain to answer questions

# Expose index to the retriever
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": int(NUMBER_OF_RESULTS)
    },
)

In [52]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    verbose=True,
    chain_type_kwargs={
        "prompt": prompt,
    },
)

In [53]:
print(qa.combine_documents_chain.llm_chain.prompt.template)

SYSTEM: You are an intelligent research assistant helping the users with their research paper questions.

Question: {question}

Strictly Use ONLY the following pieces of context to answer the question at the end. Think step-by-step and then answer.

Do not try to make up an answer:
 - If the answer to the question cannot be determined from the context alone, say "I cannot determine the answer to that."
 - If the context is empty, just say "I do not know the answer to that."

{context}

Question: {question}
Helpful Answer:


In [55]:
# Enable for troubleshooting
qa.combine_documents_chain.verbose = True
qa.combine_documents_chain.llm_chain.verbose = True
qa.combine_documents_chain.llm_chain.llm.verbose = True

import textwrap


def formatter(result):
    print(f"Query: {result['query']}")
    print("." * 80)
    print(f"Response: {wrap(result['result'])}")
    print("." * 80)
    if "source_documents" in result.keys():
        for idx, ref in enumerate(result["source_documents"]):
            print("-" * 80)
            print(f"REFERENCE #{idx}")
            print("-" * 80)
            if "score" in ref.metadata:
                print(f"Matching Score: {ref.metadata['score']}")
            if "source" in ref.metadata:
                print(f"Document Source: {ref.metadata['source']}")
            if "document_name" in ref.metadata:
                print(f"Document Name: {ref.metadata['document_name']}")
            print("." * 80)
            print(f"Content: \n{wrap(ref.page_content)}")
    print("." * 80)


def wrap(s):
    return "\n".join(textwrap.wrap(s, width=120, break_long_words=False))


def ask(query, qa=qa, k=NUMBER_OF_RESULTS, search_distance=0.5):
    # qa.retriever.search_kwargs["search_distance"] = SEARCH_DISTANCE_THRESHOLD
    # qa.retriever.search_kwargs["k"] = NUMBER_OF_RESULTS
    result = qa({"query": query})
    return formatter(result)

In [None]:
ask(query=search_query)



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSYSTEM: You are an intelligent research assistant helping the users with their research paper questions.

Question: What should be considered when taking derivatives of ReLU?

Strictly Use ONLY the following pieces of context to answer the question at the end. Think step-by-step and then answer.

Do not try to make up an answer:
 - If the answer to the question cannot be determined from the context alone, say "I cannot determine the answer to that."
 - If the context is empty, just say "I do not know the answer to that."

identity function often helps speed up convergence, since it resembles a linear model, as long as
the values are close to zero. Another challenge that needs to be overcome is vanishing derivatives,
which is visibly present for Heaviside, sigmoid and hyperbolic tangent. In contrast, ReLU is 