# Retrieval Augmented Generation (RAG)

Experimenting with LangChain for RAG. 

**Dataset**: TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. 

**Output**: Evidence that will be used in the Revision part of the Research & Revision framework.

In [1]:
import os
import openai
import numpy as np
import time

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_community.chat_models import ChatOpenAI

from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

from IPython.display import HTML, display
openai.api_key = os.getenv("OPENAI_API_KEY")

In [2]:
persist_directory = '/Users/kremerr/Documents/GitHub/RARR/triviaqa_vecdb'
embedding = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

  warn_deprecated(


## Loading the dataset

In [None]:
folder_path = '/Users/kremerr/Documents/GitHub/RARR/trivia_qa/evidence'
trivia_qa_files = []

# Walk through the directory
for root, dirs, files in os.walk(folder_path):
    for file in files:
        if file.endswith('.txt'):
            # Construct the full file path and add it to the list
            trivia_qa_files.append(os.path.join(root, file))

trivia_qa_files.sort()
print(f"Found {len(trivia_qa_files)} TXT files.")

In [9]:
for i in range(0, len(trivia_qa_files)):
    if trivia_qa_files[i] =='/Users/kremerr/Documents/GitHub/RARR/evidence/web/0/0_836.txt':
        print(trivia_qa_files[i])
        print(i)

/Users/kremerr/Documents/GitHub/RARR/evidence/web/0/0_836.txt
1839


In [10]:
loaders = [
    TextLoader(filepath) for filepath in trivia_qa_files]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [11]:
print(docs[0])

page_content="1. What word, extended from a more popular term, refers to a fictional book of between 20,000 and 50,000 words? - Jade Wright - Liverpool Echo\nNews Opinion\n1. What word, extended from a more popular term, refers to a fictional book of between 20,000 and 50,000 words?\n2. Who wrote the 1855 poem The Charge of the Light Brigade?\n\xa0Share\nGet daily updates directly to your inbox\n+ Subscribe\nCould not subscribe, try again laterInvalid Email\n2. Who wrote the 1855 poem The Charge of the Light Brigade?\n3. In 1960 the UK publishing ban was lifted on what 1928 book?\n4. How many times would a quarto sheet be folded?\n5. Who wrote the seminal 1936 self-help book How to Win Friends and Influence People?\n6. Who in 1450 invented movable type, thus revolutionising printing?\n7. Which Polish-born naturalised British novelist's real surname was Korzeniowski?\n8. Which short-lived dramatist is regarded as the first great exponent of blank verse?\n9. Who wrote the maxim “Cogito, 

## Splitting the documents

In [12]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

In [13]:
splits = text_splitter.split_documents(docs)

In [14]:
for i in range(10):
    print(splits[i].page_content)
    print()

1. What word, extended from a more popular term, refers to a fictional book of between 20,000 and 50,000 words? - Jade Wright - Liverpool Echo
News Opinion
1. What word, extended from a more popular term, refers to a fictional book of between 20,000 and 50,000 words?
2. Who wrote the 1855 poem The Charge of the Light Brigade?
 Share
Get daily updates directly to your inbox
+ Subscribe
Could not subscribe, try again laterInvalid Email
2. Who wrote the 1855 poem The Charge of the Light Brigade?
3. In 1960 the UK publishing ban was lifted on what 1928 book?
4. How many times would a quarto sheet be folded?
5. Who wrote the seminal 1936 self-help book How to Win Friends and Influence People?
6. Who in 1450 invented movable type, thus revolutionising printing?
7. Which Polish-born naturalised British novelist's real surname was Korzeniowski?
8. Which short-lived dramatist is regarded as the first great exponent of blank verse?

8. Which short-lived dramatist is regarded as the first great e

In [15]:
len(splits)

9139547

## Creating a vectorstore using Chroma

In [16]:
embedding = OpenAIEmbeddings(disallowed_special=())

  warn_deprecated(


In [17]:
sentence1 = splits[0].page_content
sentence2 = splits[1].page_content

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)

np.dot(embedding1, embedding2)

0.8450307602799495

In [18]:
%pwd

'/Users/kremerr/Documents/GitHub/RARR/notebooks'

In [21]:
try:
    persist_directory = '/Users/kremerr/Documents/GitHub/RARR/triviaqa_vecdb'
    vectordb = Chroma.from_documents(
        documents=splits,
        embedding=embedding,
        persist_directory=persist_directory,
        collection_name="trivia_qa_collection"
    )
except openai.RateLimitError as e:
                print(f"RateLimitError: {e}. Waiting before retrying...")
                wait_time = 10  # Default wait time of 10 seconds
                time.sleep(wait_time)
            
except openai.OpenAIError as e:
                print(f"OpenAIError: {e}.")
                raise

except UnicodeEncodeError as e:
                start = max(e.start - 10, 0)
                end = min(e.end + 10, len(e.object))
                surrounding_text = e.object[start:end]
                print(f"UnicodeEncodeError: cannot encode text surrounding '{surrounding_text}' at position {e.start}-{e.end}")
                raise

: 

In [None]:
print(vectordb._collection.count())

22747


In [20]:
question = "What is the attention mechanism in a transformer model?"
docs = vectordb.similarity_search(question,k=3)

In [21]:
for i in range(len(docs)):
    print(docs[i].page_content)
    print()

Figure 1: Multi-head attention & scaled dot product attention (Vaswani et al., 2017)
2.1 T RANSFORMER ARCHITECTURE
The transformer model was first proposed in 2017 for a machine translation task, and since then, numerous models have
been developed based on the inspiration of the original transformer model to address a variety of tasks across different fields.
While some models have utilized the vanilla transformer architecture as is, others have leveraged only the encoder or decoder
module of the transformer model. As a result, the task and performance of transformer-based models can vary depending on
the specific architecture employed. Nonetheless, a key and widely used component of transformer models is self-attention,
which is essential to their functionality. All transformer-based models employ the self-attention mechanism and multi-head
attention, which typically forms the primary learning layer of the architecture. Given the significance of self-attention, the
role of the attenti

In [22]:
docs = vectordb.max_marginal_relevance_search(question,k=2, fetch_k=3)

In [23]:
for i in range(len(docs)):
    print(docs[i].page_content)
    print()

Figure 1: Multi-head attention & scaled dot product attention (Vaswani et al., 2017)
2.1 T RANSFORMER ARCHITECTURE
The transformer model was first proposed in 2017 for a machine translation task, and since then, numerous models have
been developed based on the inspiration of the original transformer model to address a variety of tasks across different fields.
While some models have utilized the vanilla transformer architecture as is, others have leveraged only the encoder or decoder
module of the transformer model. As a result, the task and performance of transformer-based models can vary depending on
the specific architecture employed. Nonetheless, a key and widely used component of transformer models is self-attention,
which is essential to their functionality. All transformer-based models employ the self-attention mechanism and multi-head
attention, which typically forms the primary learning layer of the architecture. Given the significance of self-attention, the
role of the attenti

In [24]:
vectordb.persist()

## Question Answering

In [25]:
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
# os.environ["LANGCHAIN_API_KEY"] = "ls__4c9a3644dee14218912f9ad032923e90"

In [58]:
question = "What was the first name of the 22nd president of the United States of America?" #"What is a good replacement for eggs in baking?"

llm = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0)

prompt_template = """<human>: Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':
### CONTEXT
{context}
### QUESTION
Question: {question}
\n
<bot>:
"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(prompt_template)

In [59]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

result = qa_chain({"query": question})

In [60]:
result["result"]

"I don't know."