source: https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/pdf.html#

In [1]:
%pip install faiss-cpu pymupdf chromadb -q

# %pip install "unstructured[local-inference]" -q
# %pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2" -q
# %brew install poppler
# at the end, still had errors

Note: you may need to restart the kernel to use updated packages.


## PyPDF

In [9]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("pdfs/09_The Normal (mu, sigma) model.pdf")
pages = loader.load_and_split()

print("pages" , len(pages))
print("page 1 \n", pages[0].page_content)

pages 33
page 1 
 Bayesian Data Analysis
Module 3: Models with more than one parameter
Stat 474/574


In [13]:
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("What is conjugate prior for mean?", k=2)
for doc in docs:
    print(f"\n page {str(doc.metadata['page'])} \n {doc.page_content}")


 page 25 
 Conjugate prior for the normal model (cont'd)
Interpretation of posterior parameters:
As before,nis a weighted average of the prior mean and the sample
mean.
The posterior \guess" n2
nis the sum of the sample sum of squared
deviations, the prior sum of squared deviations, and additional
uncertainty due to the dierence between the sample mean and the
prior mean.
Stat 474/574 (ISU) Spring, 2023 26 / 33

 page 21 
 Conjugate prior for the normal model
Recall that using a non-informative prior, we found that
p(j2;y)/N(y;2=n)
p(2jy)/Inv 2(n 1;s2):
Then, factoring p(;2) =p(j2)p(2) the conjugate prior for 2
would also be scaled inverse 2and for(conditional on 2) would
be normal.
Stat 474/574 (ISU) Spring, 2023 22 / 33


## pymupdf

In [29]:
from langchain.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("pdfs/09_The Normal (mu, sigma) model.pdf")

data = loader.load()
data[0]

print(data[25].page_content)

Conjugate prior for the normal model (cont’d)
Interpretation of posterior parameters:
As before, µn is a weighted average of the prior mean and the sample
mean.
The posterior “guess” νnσ2
n is the sum of the sample sum of squared
deviations, the prior sum of squared deviations, and additional
uncertainty due to the diﬀerence between the sample mean and the
prior mean.
Stat 474/574 (ISU)
Spring, 2023
26 / 33



## VectorDB QA

https://langchain.readthedocs.io/en/latest/modules/chat/examples/vector_db_qa_with_sources.html

In [38]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [39]:
from langchain.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("pdfs/09_The Normal (mu, sigma) model.pdf")

documents = loader.load()

In [40]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(documents, embeddings)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


In [44]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

system_template="""Use the following pieces of context to answer the users question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
{context}"""
messages = [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template("{question}")
]
prompt = ChatPromptTemplate.from_messages(messages)

In [46]:
from langchain.chains import VectorDBQA

chain_type_kwargs = {"prompt": prompt}
qa = VectorDBQA.from_chain_type(llm=ChatOpenAI(), chain_type="stuff", vectorstore=docsearch, chain_type_kwargs=chain_type_kwargs)

In [56]:
res = qa.run("what is joint posterior distribution, conditional posterior, and marginal posterior of normal? Provide the notations and an english translation")

In [69]:
from pprint import pprint

# wrap text in python
import textwrap

def wrap_text(text, width=120):
    return "\n".join(textwrap.wrap(text, width=width))

print(wrap_text(res))

The joint posterior distribution of normal in this context is denoted as p(µ, σ2|y), which represents the probability
distribution of the two parameters (mean and variance) after observing the data y. This distribution can be obtained by
multiplying the prior distribution of the two parameters with the likelihood function of the data, and is used to make
inference about both parameters simultaneously.  The conditional posterior of normal, specifically of the mean parameter
µ given the variance parameter σ2 and the data y, is denoted as µ|σ2, y. This represents the probability distribution of
the mean parameter after observing the data y, conditional on the value of the variance parameter σ2. This distribution
can be obtained from the joint posterior by using Bayes' theorem and isolating the mean parameter as the variable of
interest.  The marginal posterior of normal is denoted as p(θ1|y), and represents the probability distribution of the
variable of interest (in this case, the mean p

## multiple documents

In [None]:
import os

pdfs = os.listdir("pdfs")
loaders = []
for pdf in pdfs:
    loaders.append(PyMuPDFLoader(f"pdfs/{pdf}"))

docs = []
for loader in loaders:
    docs.extend(loader.load())

In [None]:
print(docs)