# Retrieval Augmented Generation (RAG)

Experimenting with LangChain for RAG. 
Dataset: 277 ArXiV papers in .pdf format. 
Output: Evidence that will be used in the Revision part of the Research & Revision framework.

In [62]:
import os
import openai
import numpy as np

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_community.chat_models import ChatOpenAI

from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

from IPython.display import HTML, display
openai.api_key = os.getenv("OPENAI_API_KEY")

In [63]:
%pwd

'/Users/kremerr/Documents/GitHub/RARR/notebooks'

## Loading the dataset

In [66]:
folder_path = '/Users/kremerr/Documents/GitHub/RARR/archive'
pdf_files = []

# Walk through the directory
for root, dirs, files in os.walk(folder_path):
    for file in files:
        if file.endswith('.pdf'):
            # Construct the full file path and add it to the list
            pdf_files.append(os.path.join(root, file))

pdf_files.sort()
print(f"Found {len(pdf_files)} PDF files.")

Found 277 PDF files.


In [67]:
for i in range(0, len(pdf_files)):
    if pdf_files[i] =='/Users/kremerr/Documents/GitHub/RARR/archive/2307.14334.pdf':
        print(pdf_files[i])
        print(i)

/Users/kremerr/Documents/GitHub/RARR/archive/2307.14334.pdf
0


In [6]:
loaders = [
    PyPDFLoader(filepath) for filepath in pdf_files]
docs = []
for loader in loaders:
    docs.extend(loader.load())

could not convert string to float: '0.0000000000-170985' : FloatObject (b'0.0000000000-170985') invalid; use 0.0 instead
could not convert string to float: '0.0000000000-170985' : FloatObject (b'0.0000000000-170985') invalid; use 0.0 instead


In [7]:
print(docs[8])

page_content='independent evaluation where raters assessed the quality of individual report findings. Prior to performing the\nfinal evaluation, we iterated upon the instructions for the raters and calibrated their grades using a pilot set\nof 25 cases that were distinct from the evaluation set. Side-by-side evaluation was performed for all 246 cases,\nwhere each case was rated by a single radiologist randomly selected from a pool of four. For independent\nevaluation, each of the four radiologists independently annotated findings generated by three Med-PaLM M\nmodel variants (12B, 84B, and 562B) for every case in the evaluation set. Radiologists were blind to the\nsource of the report findings for all evaluation tasks, and the reports were presented in a randomized order.\nSide-by-side evaluation The input to each side-by-side evaluation was a single chest X-ray, along with the\n“indication” section from the MIMIC-CXR study. Four alternative options for the “findings” section of the\nr

## Splitting the documents

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150
)

In [9]:
splits = text_splitter.split_documents(docs)

In [10]:
for i in range(10):
    print(splits[i].page_content)
    print()

Towards Generalist Biomedical AI
Tao Tu∗,‡, 1, Shekoofeh Azizi∗,‡, 2,
Danny Driess2, Mike Schaekermann1, Mohamed Amin1, Pi-Chuan Chang1, Andrew Carroll1,
Chuck Lau1, Ryutaro Tanno2, Ira Ktena2, Basil Mustafa2, Aakanksha Chowdhery2, Yun Liu1,
Simon Kornblith2, David Fleet2, Philip Mansfield1, Sushant Prakash1, Renee Wong1, Sunny Virmani1,
Christopher Semturs1, S Sara Mahdavi2, Bradley Green1, Ewa Dominowska1, Blaise Aguera y Arcas1,
Joelle Barral2, Dale Webster1, Greg S. Corrado1, Yossi Matias1, Karan Singhal1, Pete Florence2,
Alan Karthikesalingam†,‡,1and Vivek Natarajan†,‡,1
1Google Research,2Google DeepMind
Medicine is inherently multimodal, with rich data modalities spanning text, imaging, genomics, and more.
Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret
this data at scale can potentially enable impactful applications ranging from scientific discovery to care
delivery. To enable the development of these models, we first cur

In [69]:
print("There are {i} splits in total.".format(i=len(splits)))

There are 22747 splits in total.


## Creating a vectorstore using Chroma

In [15]:
embedding = OpenAIEmbeddings(disallowed_special=())

In [46]:
sentence1 = splits[0].page_content
sentence2 = splits[1].page_content

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)

np.dot(embedding1, embedding2)

0.9205597944928818

In [17]:
%pwd

'/Users/kremerr/Documents/GitHub/RARR/notebooks'

In [18]:
persist_directory = '/Users/kremerr/Documents/GitHub/RARR/chroma'
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory,
    collection_name="langchain_collection"
)

In [19]:
print(vectordb._collection.count())

22747


In [20]:
question = "What is the attention mechanism in a transformer model?"
docs = vectordb.similarity_search(question,k=3)

In [21]:
for i in range(len(docs)):
    print(docs[i].page_content)
    print()

Figure 1: Multi-head attention & scaled dot product attention (Vaswani et al., 2017)
2.1 T RANSFORMER ARCHITECTURE
The transformer model was first proposed in 2017 for a machine translation task, and since then, numerous models have
been developed based on the inspiration of the original transformer model to address a variety of tasks across different fields.
While some models have utilized the vanilla transformer architecture as is, others have leveraged only the encoder or decoder
module of the transformer model. As a result, the task and performance of transformer-based models can vary depending on
the specific architecture employed. Nonetheless, a key and widely used component of transformer models is self-attention,
which is essential to their functionality. All transformer-based models employ the self-attention mechanism and multi-head
attention, which typically forms the primary learning layer of the architecture. Given the significance of self-attention, the
role of the attenti

In [22]:
docs = vectordb.max_marginal_relevance_search(question,k=2, fetch_k=3)

In [23]:
for i in range(len(docs)):
    print(docs[i].page_content)
    print()

Figure 1: Multi-head attention & scaled dot product attention (Vaswani et al., 2017)
2.1 T RANSFORMER ARCHITECTURE
The transformer model was first proposed in 2017 for a machine translation task, and since then, numerous models have
been developed based on the inspiration of the original transformer model to address a variety of tasks across different fields.
While some models have utilized the vanilla transformer architecture as is, others have leveraged only the encoder or decoder
module of the transformer model. As a result, the task and performance of transformer-based models can vary depending on
the specific architecture employed. Nonetheless, a key and widely used component of transformer models is self-attention,
which is essential to their functionality. All transformer-based models employ the self-attention mechanism and multi-head
attention, which typically forms the primary learning layer of the architecture. Given the significance of self-attention, the
role of the attenti

In [24]:
vectordb.persist()

## Question Answering

In [25]:
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
# os.environ["LANGCHAIN_API_KEY"] = "ls__4c9a3644dee14218912f9ad032923e90"

In [58]:
question = "What was the first name of the 22nd president of the United States of America?" #"What is a good replacement for eggs in baking?"

llm = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0)

prompt_template = """<human>: Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':
### CONTEXT
{context}
### QUESTION
Question: {question}
\n
<bot>:
"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(prompt_template)

In [59]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

result = qa_chain({"query": question})

In [60]:
result["result"]

"I don't know."