This is an alternative approach to the one in the "Query a pdf document - FRTB-SA regulation" notebook saved in the same folder

The one in this notebook uses the langchain library and Pinecone vector store  

In [1]:
import os
import pinecone
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

  from tqdm.autonotebook import tqdm


In [2]:
PDF_DOC = "data/d352.pdf"

This analysis is based on similar approaches i found online, e.g. 
https://bennycheung.github.io/ask-a-book-questions-with-langchain-openai

### Extract the Book Content

In [3]:
loader = UnstructuredPDFLoader(PDF_DOC)
data = loader.load()

detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.


In [4]:
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content):,} characters in your document')

You have 1 document(s) in your data
There are 297,193 characters in your document


### Split Book into Smaller Chunks 

We will be dividing the loaded PDF document into smaller “pages” of 1000 characters each. The reason for doing this is to provide contextual information to OpenAI when we ask it a question. This is because OpenAI embeddings work best with shorter pieces of text. Instead of making OpenAI read the entire book every time we ask a question, it is more efficient and cost-effective to give it a smaller section of relevant information to process.

In [21]:
text_splitter = RecursiveCharacterTextSplitter(
  chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(data)

print (f'Now you have {len(texts)} documents')

Now you have 381 documents


###  Build Semantic Index

Create embeddings of our documents to get ready for semantic search. We store these vectors online in a Pinecone vector store so we can add more books to our corpus and not have to re-read the PDFs each time. We also assign a book namespace in the index.

The user needs to create a Pinecone key, env and index
An example of how to create an index is here: https://www.youtube.com/watch?v=h0DHDp1FbmQ

In [22]:
embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])

In [23]:
# initialize pinecone
# the user must have created their own pinecone key & env
# this is an example of how to do this: https://www.youtube.com/watch?v=h0DHDp1FbmQ
pinecone.init(
    api_key=os.environ["PINECONE_API_KEY"],  
    environment=os.environ["PINECONE_API_ENV"] 
)

In [24]:
index_name = "langchain1" # use the name of the Pinecone index you have created here
namespace = "book"

docsearch = Pinecone.from_texts(
  [t.page_content for t in texts], embeddings,
  index_name=index_name, namespace=namespace)

## Ask questions 

After we built the index, we are ready to query those docs to get our answer back.

In [25]:
llm = OpenAI(temperature=0., openai_api_key=os.environ["OPENAI_API_KEY"])
chain = load_qa_chain(llm, chain_type="stuff")

In [26]:
query = "what is the definition of the Trading Desk?"

In [27]:
docs = docsearch.similarity_search(query,
  include_metadata=True, namespace=namespace)

chain.run(input_documents=docs, question=query)

' A trading desk for the purposes of the regulatory capital framework is an unambiguously defined group of traders or trading accounts. Each individual trader or trading account must be assigned to only one trading desk. The desk must have a clear reporting line to senior management and must have a clear and formal compensation policy linked to its pre-established objectives.'