# QA over PDF file

## Intro
* We will create a Q&A app that can answer questions about PDF files.
* We will use a Document Loader to load text in a format usable by an LLM, then build a retrieval-augmented generation (RAG) pipeline to answer questions, including citations from the source material.
* **We will use a basic approach for this project. You will see more advanced ways to solve the same problem in next projects**.

In [9]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
groq_api_key = os.environ["GROQ_API_KEY"]
google_api_key = os.environ['GOOGLE_API_KEY']

In [10]:
#load our chat completion model
from langchain_groq import ChatGroq

llm = ChatGroq(model="mixtral-8x7b-32768")

## Load the PDF file
* The loader reads the PDF at the specified path into memory.
* It then extracts text data using the pypdf package.
* Finally, it creates a LangChain Document for each page of the PDF with the page's content and some metadata about where in the document the text came from.

If you are using the pre-loaded poetry shell, you do not need to install the following packages because they are already pre-loaded for you:

In [11]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./data/Be_Good.pdf"

loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

11


In [12]:
print(docs[0].page_content[0:100])
print(docs[0].metadata)

Be Good - Essay by Paul Graham
Be Good
Be good
April 2008(This essay is derived from a talk at the 2
{'source': './data/Be_Good.pdf', 'page': 0}


## RAG
* We will use the vector database (aka. vector store) Chroma DB.
* Using a text splitter, we will split the loaded PDF into smaller documents that can more easily fit into an LLM's context window, then load them into a vector store.
* We can then create a retriever from the vector store for use in our RAG chain:

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [14]:
from langchain_chroma import Chroma
#from langchain_openai import OpenAIEmbeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(documents=splits, embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))

retriever = vectorstore.as_retriever()

#### We will use two pre-defined chains to construct the final rag_chain:
In this exercise we are going to use two pre-defined chains to build the final chain:
* create_stuff_documents_chain
* create_retrieval_chain
* Let's learn a little bit more about these two pre-defined chains.

#### create_stuff_documents_chain
The create_stuff_documents_chain takes a list of documents and formats them all into a prompt, then passes that prompt to an LLM. It passes ALL documents, so you should make sure it fits within the context window of the LLM you are using.
1. **Taking a List of Documents**: This function starts by receiving a group of documents that you provide.
  
2. **Formatting into a Prompt**: It then takes all these documents and organizes them into a specific prompt. A prompt is essentially a text setup that is used to feed information into a language model (like an LLM, or Large Language Model).

3. **Passing to an LLM**: After formatting the documents into a prompt, this function sends the formatted prompt to a language model. The model will process this information to perform tasks like answering questions, generating text, etc.

4. **Fit within Context Window**: The function sends all the documents at once to the LLM. However, it's important to make sure that the total length of the prompt does not exceed what the LLM can handle at one time. This limit is known as the "context window" of the LLM. If the prompt is too long, the model might not process it effectively.

In simpler terms, think of this chain as a way of taking several pieces of text, bundling them together in a specific way, and then feeding them to an LLM that reads and uses this bundled text to do its job. Just make sure the bundle isn’t too big for the LLM to handle at once!


#### create_retrieval_chain
The create_retrieval_chain takes in a user inquiry, which is then passed to the retriever to fetch relevant documents. Those documents (and original inputs) are then passed to an LLM to generate a response.
1. **Receiving a User Inquiry**: This process begins when a user asks a question or makes a request.

2. **Using a Retriever to Fetch Documents**: The function then uses a retriever to find documents that are relevant to the user's inquiry. This means it searches through available information to pick out parts that can help answer the question.

3. **Passing Information to an LLM**: After gathering the relevant documents, both these documents and the original user inquiry are sent to an LLM.

4. **Generating a Response**: The LLM processes all the information it receives to come up with an appropriate response, which is then given back to the user.

In simpler terms, this chain acts like a smart assistant that first looks up information based on your question, gathers useful details, and then uses those details along with your original question to craft a helpful answer.

In [15]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)

rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "What is this article about?"})

results["answer"]

'This article by Paul Graham is about the importance of being good in business, not just to have a positive image but also as a strategy for success. He suggests that being good can lead to genuine growth and help a company maintain a competitive edge. He also discusses the concept of Microsoft\'s "don\'t be evil" approach as a potential elixir of corporate youth. The article is not a sanctimonious call to be good, but rather a pragmatic argument for its effectiveness.'

* If you print the whole `results` you will see that **you get both the answer, and the context the LLM used to generate that answer**. See it below:

In [16]:
results

{'input': 'What is this article about?',
 'context': [Document(metadata={'page': 10, 'source': './data/Be_Good.pdf'}, page_content="Be Good - Essay by Paul Graham\nGoogle does.Most explicitly benevolent projects don't hold themselves sufficiently\naccountable.  They act as if having good intentions were enough to\nguarantee good effects.[3] Users dislike their\nnew operating system so much that they're starting petitions to\nsave the old one.  And the old one was nothing special.  The hackers\nwithin Microsoft must know in their hearts that if the company\nreally cared about users they'd just advise them to switch to OSX.Thanks to Trevor Blackwell, Paul\nBuchheit, Jessica Livingston,\nand Robert Morris for reading drafts of this.\nPage 11"),
  Document(metadata={'page': 9, 'source': './data/Be_Good.pdf'}, page_content="You can't be buying users; that's a pyramid scheme.   But a company\nwith rapid, genuine growth is valuable, and eventually markets learn\nhow to value valuable things.[

* Examining the values under the context further, you can see that they are documents that each contain a chunk of the ingested page content. These documents also preserve the original **metadata** from way back when you first loaded them:

In [17]:
print(results["context"][0].metadata)

{'page': 10, 'source': './data/Be_Good.pdf'}


* This particular chunk came from page 0 in the original PDF. You can use this data to show which page in the PDF the answer came from, allowing users to quickly verify that answers are based on the source material.