# Retrieval Augmented Generation (RAG)

Experimenting with LangChain for RAG. 
Dataset: 277 ArXiV papers in .pdf format. 
Output: Evidence that will be used in the Revision part of the Research & Revision framework.

In [15]:
import os
import openai
import numpy as np

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_community.chat_models import ChatOpenAI

from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

from IPython.display import HTML, display
openai.api_key = os.getenv("OPENAI_API_KEY")

In [2]:
%pwd

'/Users/kremerr/Documents/GitHub/RARR/notebooks'

## Loading the dataset

In [3]:
folder_path = '/Users/kremerr/Documents/GitHub/RARR/archive'
pdf_files = []

# Walk through the directory
for root, dirs, files in os.walk(folder_path):
    for file in files:
        if file.endswith('.pdf'):
            # Construct the full file path and add it to the list
            pdf_files.append(os.path.join(root, file))

pdf_files.sort()
print(f"Found {len(pdf_files)} PDF files.")

Found 277 PDF files.


In [4]:
for i in range(0, len(pdf_files)):
    if pdf_files[i] =='/Users/kremerr/Documents/GitHub/RARR/archive/2307.14334.pdf':
        print(pdf_files[i])
        print(i)

/Users/kremerr/Documents/GitHub/RARR/archive/2307.14334.pdf
0


In [5]:
loaders = [
    PyPDFLoader(filepath) for filepath in pdf_files]
docs = []
for loader in loaders:
    docs.extend(loader.load())

could not convert string to float: '0.0000000000-170985' : FloatObject (b'0.0000000000-170985') invalid; use 0.0 instead
could not convert string to float: '0.0000000000-170985' : FloatObject (b'0.0000000000-170985') invalid; use 0.0 instead


In [6]:
print(docs[8])

page_content='independent evaluation where raters assessed the quality of individual report findings. Prior to performing the\nfinal evaluation, we iterated upon the instructions for the raters and calibrated their grades using a pilot set\nof 25 cases that were distinct from the evaluation set. Side-by-side evaluation was performed for all 246 cases,\nwhere each case was rated by a single radiologist randomly selected from a pool of four. For independent\nevaluation, each of the four radiologists independently annotated findings generated by three Med-PaLM M\nmodel variants (12B, 84B, and 562B) for every case in the evaluation set. Radiologists were blind to the\nsource of the report findings for all evaluation tasks, and the reports were presented in a randomized order.\nSide-by-side evaluation The input to each side-by-side evaluation was a single chest X-ray, along with the\n“indication” section from the MIMIC-CXR study. Four alternative options for the “findings” section of the\nr

## Splitting the documents

In [7]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150
)

In [8]:
splits = text_splitter.split_documents(docs)

In [9]:
for i in range(10):
    print(splits[i].page_content)
    print()

Towards Generalist Biomedical AI
Tao Tu∗,‡, 1, Shekoofeh Azizi∗,‡, 2,
Danny Driess2, Mike Schaekermann1, Mohamed Amin1, Pi-Chuan Chang1, Andrew Carroll1,
Chuck Lau1, Ryutaro Tanno2, Ira Ktena2, Basil Mustafa2, Aakanksha Chowdhery2, Yun Liu1,
Simon Kornblith2, David Fleet2, Philip Mansfield1, Sushant Prakash1, Renee Wong1, Sunny Virmani1,
Christopher Semturs1, S Sara Mahdavi2, Bradley Green1, Ewa Dominowska1, Blaise Aguera y Arcas1,
Joelle Barral2, Dale Webster1, Greg S. Corrado1, Yossi Matias1, Karan Singhal1, Pete Florence2,
Alan Karthikesalingam†,‡,1and Vivek Natarajan†,‡,1
1Google Research,2Google DeepMind
Medicine is inherently multimodal, with rich data modalities spanning text, imaging, genomics, and more.
Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret
this data at scale can potentially enable impactful applications ranging from scientific discovery to care
delivery. To enable the development of these models, we first cur

## Creating a vectorstore using Chroma

In [16]:
embedding = OpenAIEmbeddings(disallowed_special=())

In [40]:
sentence1 = splits[0].page_content
sentence2 = splits[1].page_content

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)

np.dot(embedding1, embedding2)

0.9205249452627864

In [41]:
%pwd

'/Users/kremerr/Documents/GitHub/RARR/notebooks'

In [46]:
persist_directory = '/Users/kremerr/Documents/GitHub/RARR/chroma'
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

OperationalError: attempt to write a readonly database

In [44]:
print(vectordb._collection.count())

NameError: name 'vectordb' is not defined

In [None]:
question = "What is the attention mechanism in a transformer model?"
docs = vectordb.similarity_search(question,k=3)

In [None]:
for i in range(len(docs)):
    print(docs[i].page_content)
    print()

Towards Understanding Chain-of-Thought Prompting:
An Empirical Study of What Matters
Boshi Wang1Sewon Min2Xiang Deng1Jiaming Shen3You Wu3
Luke Zettlemoyer2Huan Sun1
1The Ohio State University2University of Washington3Google Research
{wang.13930,deng.595,sun.397}@osu.edu
{sewon,lsz}@cs.washington.edu ,{jmshen,wuyou}@google.com
Abstract
Chain-of-Thought (CoT) prompting can dra-
matically improve the multi-step reasoning abil-
ities of large language models (LLMs). CoT
explicitly encourages the LLM to generate in-
termediate rationales for solving a problem, by
providing a series of reasoning steps in the
demonstrations. Despite its success, there is
still little understanding of what makes CoT
prompting effective and which aspects of the
demonstrated reasoning steps contribute to its
performance. In this paper, we show that
CoT reasoning is possible even with invalid
demonstrations—prompting with invalid rea-
soning steps can achieve over 80-90% of the
performance obtained using CoT unde

In [None]:
docs = vectordb.max_marginal_relevance_search(question,k=2, fetch_k=3)

In [None]:
for i in range(len(docs)):
    print(docs[i].page_content)
    print()

Towards Understanding Chain-of-Thought Prompting:
An Empirical Study of What Matters
Boshi Wang1Sewon Min2Xiang Deng1Jiaming Shen3You Wu3
Luke Zettlemoyer2Huan Sun1
1The Ohio State University2University of Washington3Google Research
{wang.13930,deng.595,sun.397}@osu.edu
{sewon,lsz}@cs.washington.edu ,{jmshen,wuyou}@google.com
Abstract
Chain-of-Thought (CoT) prompting can dra-
matically improve the multi-step reasoning abil-
ities of large language models (LLMs). CoT
explicitly encourages the LLM to generate in-
termediate rationales for solving a problem, by
providing a series of reasoning steps in the
demonstrations. Despite its success, there is
still little understanding of what makes CoT
prompting effective and which aspects of the
demonstrated reasoning steps contribute to its
performance. In this paper, we show that
CoT reasoning is possible even with invalid
demonstrations—prompting with invalid rea-
soning steps can achieve over 80-90% of the
performance obtained using CoT unde

In [None]:
vectordb.persist()

## Question Answering

In [None]:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
os.environ["LANGCHAIN_API_KEY"] = "ls__4c9a3644dee14218912f9ad032923e90"

In [None]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0)

prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(prompt_template)

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

result = qa_chain({"query": question})

In [None]:
result["result"]