#### PyPDFLoader
provides functionality for loading PDF documents within the LangChain framework

In [1]:
# !pip install pypdf
from langchain_community.document_loaders import PyPDFLoader

Let's first take a look at the pdf document

In [2]:
!open documents/nihms-1828057.pdf

This line of the code initializes the loader.

In [3]:
loader = PyPDFLoader("documents/nihms-1828057.pdf")

Load the PDF using pypdf into the "pages" variable. 

Each page is stored as a separate chunk. It also stores page numbers in metadata.

In [4]:
pages = loader.load_and_split()

In [5]:
pages[:3]

[Document(page_content='Pancreatic Cancer: A Review\nWungki Park, MD ,\nDepartment of Medicine, Memorial Sloan Kettering Cancer Center, New York, New York\nDavid M. Rubenstein Center for Pancreatic Cancer Research, New York, New York\nDepartment of Medicine, Weill Cornell Medical College, New York, New York\nParker Institute for Cancer Immunotherapy, San Francisco, California\nAkhil Chawla, MD ,\nDepartment of Surgery, Northwestern Medicine Regional Medical Group, Northwestern University \nFeinberg School of Medicine, Chicago, Illinois\nRobert H. Lurie Comprehensive Cancer Center, Chicago, Illinois\nEileen M. O’Reilly, MD\nDepartment of Medicine, Memorial Sloan Kettering Cancer Center, New York, New York\nDavid M. Rubenstein Center for Pancreatic Cancer Research, New York, New York\nDepartment of Medicine, Weill Cornell Medical College, New York, New York\nAbstract\nIMPORTANCE— Pancreatic ductal adenocarcinoma (PDAC) is a relatively uncommon cancer, \nwith approximately 60 430 new diag

In [6]:
for i in range(3):
    print(pages[i].metadata)

{'source': 'documents/nihms-1828057.pdf', 'page': 0}
{'source': 'documents/nihms-1828057.pdf', 'page': 1}
{'source': 'documents/nihms-1828057.pdf', 'page': 2}


Since each page of the pdf is still quite long, we would break the pages into smaller pieces. 

We give a bit of overlap so that no meaningful sentence is lost. 

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

documents = text_splitter.split_documents(pages)

In [8]:
print(f"{len(pages)} vs {len(documents)}")

28 vs 88


Let's now load the api_key of openAI

In [9]:
import os
from dotenv import load_dotenv

load_dotenv(".env")
openai_api_key = os.getenv("openai_api_key")

#### Embeddings:

We are going to use openAI embeddings to convert each chunk of text to numeric vectors. 

Remember, the reason is that searching throug a large number of text chunks is very time consuming. 
However, numeric vector comparision are extremely fast. 

In [10]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

#### Chroma vector database

We need to store all the numeric vectors in a database. 
One local database is Chroma. 

In [11]:
from langchain_community.vectorstores import Chroma
vector = Chroma.from_documents(documents, embeddings)

#### large language model

In [12]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(openai_api_key=openai_api_key)

#### output parser

We would like to convert the output of the chatmodel into a pure text

In [13]:
from langchain_core.output_parsers import StrOutputParser
output_parser = StrOutputParser()

#### retrievers

will be used to take the question, and compare it with all the numeric vectors in the database and return the most similar chunks of text

In [14]:
retriever = vector.as_retriever()

## Adding Memory

#### Question maker

One user asks a new question, there is a history of questions and answers in his/her mind.

Here the idea is to reformulate user's question into a format that has its own context. 

We are going to use LLM to perform this reformulation of the question. 


Here is the idea:

User's followup question <b>=></b> LLM <b>=></b> reformulated question (with history)



In [16]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

instruction_to_system = """
Given a chat history and the latest user question 
which might reference context in the chat history, formulate a standalone question 
which can be understood without the chat history. Do NOT answer the question, 
just reformulate it if needed and otherwise return it as is.
"""

question_maker_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", instruction_to_system),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{question}"),
    ]
)


question_chain = question_maker_prompt | llm | StrOutputParser()

#### example

Let's assume that we previously had asked about the shape of the moon and the ai had responded that the moon is spherical. 

Now user follows up asking for further explanation. But, does not give the context. 

The question_chain has to add the context to the followup question and make a new question. 

In [17]:
from langchain_core.messages import AIMessage, HumanMessage
question_chain.invoke({"question":"can you explain more?", 
                       "chat_history": [HumanMessage(content="you explained that the moon is round")]})

'Can you provide further details about the shape of the moon?'

#### Prompt

We now build the prompt for the question and answer. 

This time, the prompt consistes of:
* a python-list of system instruction
* a place holder to take the chat history later on
* user's question

In [18]:
# Use three sentences maximum and keep the answer concise.\
qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, provide a summary of the context. Do not generate your answer.\


{context}"""
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{question}"),
    ]
)

#### Which question to pass to LLM?

We define a function that looks at the chat history, 
* if there is a history: it will pass the question chain (that reformulates user's question)
* if chat history is empty, it will pass user's question directly

In [19]:
def contextualized_question(input: dict):
    if input.get("chat_history"):
        return question_chain
    else:
        return input["question"]

In [None]:
# def format_docs(docs):
#     return "\n\n".join(doc.page_content for doc in docs)

#### Retriever chain

We need a chain to pass the following to the llm:
* context: use the vector retriever and get the most relevant chuncks of the PDF 
* question: reformulated or the original user's question depending on the history
* chat_history: python list of the chats

We use the assign function which adds the context to whatever it gets as input and pass it to the next link of the chain. 

In [21]:
from langchain_core.runnables import RunnablePassthrough
retriever_chain = RunnablePassthrough.assign(
        context=contextualized_question | retriever #| format_docs
    )

#### example

Let's see the output of the <b>retriever_chain</b>

Look at the extra "conext" variable that is added to the "question" and "chat_history" variables. 

This is what the "assign" function does

In [22]:
retriever_chain.invoke({
    "chat_history":[HumanMessage(content="you explained that the moon is round")],
    "question": "can you explain more?"
    })

{'chat_history': [HumanMessage(content='you explained that the moon is round')],
 'question': 'can you explain more?',
 'context': [Document(page_content='Author Manuscript Author Manuscript Author Manuscript Author Manuscript', metadata={'page': 2, 'source': 'documents/nihms-1828057.pdf'}),
  Document(page_content='indeterminate liver lesions are likely to represent metastases and identify cancers that \nmay be poorly characterized on CT imaging. Positron emission tomography/CT using a \nfluorodeoxyglucose tracer is a functional imaging tool that evaluates glucose metabolism in \nthe tumor and can help distinguish benign from malignant lesions in the pancreas; however, \nit lacks the spatial resolution and can detect glucose uptake from infection, inflammation \nalso confounding interpretation.50 Positron emission tomography/CT is not considered a \nroutine staging tool.\nEndoscopic ultrasonography is used to visualize a pancreas mass directly, secure a definitive \ncytologic or histo

#### Retrieval-Augmented Generation (RAG) Chain: the man chain

This is the main chain that produces the final answer

In [26]:

rag_chain = (
    retriever_chain
    | qa_prompt
    | llm
)

In [25]:
question = "what percentage of patients have pathogenic germline gene variants?"

In [27]:
chat_history = []

ai_msg = rag_chain.invoke({"question": question, "chat_history": chat_history})
chat_history.extend([HumanMessage(content=question), ai_msg])
ai_msg

AIMessage(content='About 3.8% to 9.7% of patients with pancreatic ductal adenocarcinoma (PDAC) have pathogenic germline gene variants that increase susceptibility to the disease. These variants are mostly found in DNA damage repair genes like BRCA2, BRCA1, and ATM. Additionally, there are other rare inheritable germline variants like those in mismatch repair deficiency genes associated with Lynch syndrome, which occur in about 1% of patients with PDAC.')

In [28]:
print(ai_msg.content)

About 3.8% to 9.7% of patients with pancreatic ductal adenocarcinoma (PDAC) have pathogenic germline gene variants that increase susceptibility to the disease. These variants are mostly found in DNA damage repair genes like BRCA2, BRCA1, and ATM. Additionally, there are other rare inheritable germline variants like those in mismatch repair deficiency genes associated with Lynch syndrome, which occur in about 1% of patients with PDAC.


In [29]:
question = "Can you explain more?"
ai_msg = rag_chain.invoke({"question": question, "chat_history": chat_history})
chat_history.extend([HumanMessage(content=question), ai_msg])
ai_msg

AIMessage(content='Pancreatic ductal adenocarcinoma (PDAC) is associated with pathogenic germline gene variants in approximately 3.8% to 9.7% of patients. These variants are typically found in genes involved in DNA damage repair, such as BRCA2, BRCA1, and ATM. BRCA2 variants are more commonly associated with an increased risk of PDAC compared to BRCA1 or ATM variants. In addition to these common variants, there are rare but therapeutically important inheritable germline variants in genes related to mismatch repair deficiency, like MLH1, MSH2, MSH6, and PMS2, which are part of Lynch syndrome and occur in about 1% of PDAC patients. This information highlights the role of genetic factors in predisposing individuals to pancreatic cancer and the importance of genetic testing for better management and treatment strategies.')