# QA over PDF file

## Intro
* We will create a Q&A app that can answer questions about PDF files.
* We will use a Document Loader to load text in a format usable by an LLM, then build a retrieval-augmented generation (RAG) pipeline to answer questions, including citations from the source material.

## Setup

#### Recommended: create new virtualenv
* mkdir your_project_name
* cd your_project_name
* pyenv virtualenv 3.11.4 your_venv_name
* pyenv activate your_venv_name
* pip install jupyterlab
* jupyter lab

In [1]:
#!pip install python-dotenv

#### .env File
Remember to include:
OPENAI_API_KEY=your_openai_api_key

LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=your_langchain_api_key
LANGCHAIN_PROJECT=your_project_name

We will call our LangSmith project **qaFromPDF**.

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

#### Install LangChain

In [3]:
#!pip install langchain

## Connect with an LLM

In [4]:
#!pip install langchain-openai

* NOTE: Since right now is the best LLM in the market, we will use OpenAI by default. You will see how to connect with other Open Source LLMs like Llama3 or Mistral in a next lesson.

In [2]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

## Load the PDF file
* The loader reads the PDF at the specified path into memory.
* It then extracts text data using the pypdf package.
* Finally, it creates a LangChain Document for each page of the PDF with the page's content and some metadata about where in the document the text came from.

In [None]:
#!pip install langchain-community

In [None]:
#!pip install pypdf

In [4]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./data/Be_Good.pdf"

loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

11


In [5]:
print(docs[0].page_content[0:100])
print(docs[0].metadata)

Be Good - Essay by Paul Graham
Be Good
Be good
April 2008(This essay is derived from a talk at the 2
{'source': './data/Be_Good.pdf', 'page': 0}


## RAG
* We will use the vector database (aka. vector store) Chroma DB.
* Using a text splitter, we will split the loaded PDF into smaller documents that can more easily fit into an LLM's context window, then load them into a vector store.
* We can then create a retriever from the vector store for use in our RAG chain:

In [None]:
#!pip install langchain_chroma

In [6]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

* We will then use some built-in helpers to construct the final rag_chain:

In [8]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)

rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "What is this article about?"})

results["answer"]

'The essay "Be Good" by Paul Graham discusses the importance of creating something that people want and not worrying about making money initially when starting a business. Graham explores the idea of businesses being similar to charities and the advantages of focusing on user satisfaction. The essay also mentions examples like Google and Craigslist to illustrate these concepts.'

* If you print the whole `results` you will see that you get both a final answer in the answer key of the results dict, and the context the LLM used to generate an answer. See it below:

In [10]:
results

{'input': 'What is this article about?',
 'context': [Document(page_content="Be Good - Essay by Paul Graham\nBe Good\nBe good\nApril 2008(This essay is derived from a talk at the 2008 Startup School.)About a month after we\nstarted Y Combinator we came up with the\nphrase that became our motto: Make something people want.  We've\nlearned a lot since then, but if I were choosing now that's still\nthe one I'd pick.Another thing we tell founders is not to worry too much about the\nbusiness model, at least at first.  Not because making money is\nunimportant, but because it's so much easier than building something\ngreat.A couple weeks ago I realized that if you put those two ideas\ntogether, you get something surprising.  Make something people want.\nDon't worry too much about making money.  What you've got is a\ndescription of a charity.When you get an unexpected result like this, it could either be a\nbug or a new discovery.  Either businesses aren't supposed to be\nlike charities, and w

* Examining the values under the context further, you can see that they are documents that each contain a chunk of the ingested page content. These documents also preserve the original metadata from way back when you first loaded them:

In [9]:
print(results["context"][0].metadata)

{'page': 0, 'source': './data/Be_Good.pdf'}


* This particular chunk came from page 0 in the original PDF. You can use this data to show which page in the PDF the answer came from, allowing users to quickly verify that answers are based on the source material.